[arXiv]score: 0.18

Compliant Persona Steering Drops Llama Refusal Rate from 97% to 2%

June 26, 2026

In Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, a compliant persona linear direction in activation space gates the refusal direction — steering toward compliance drops Llama's refusal rate from 97% to 2%. Refusal is computed earlier but expressed at late layers, meaning single-direction refusal interventions miss this dependency.

HOW THIS AFFECTS YOU

●

builderSystem prompts that establish compliant personas may structurally undermine refusal behavior in deployed 7–8B models, warranting review of persona framing in production prompts.

●

researcherThis reframes refusal as a two-stage mechanism gated by persona, requiring multi-direction intervention models rather than single refusal-direction ablations.

●

policyPersona-based jailbreaks are mechanistically grounded here — safety evaluations that test refusal in isolation underestimate real-world bypass risk.

read original ↗arxiv.org

← back to feed