Matching Ranks Over Probability Yields Truly Deep Safety Alignment
Published on arXiv
2512.05518
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
PRESTO achieves up to 4.7× improvement in mean StrongREJECT score under RAP attacks across three open-source LLMs (including Llama 2 7B Chat) with low impact to model utility.
PRESTO (PRefill attEntion STOpping) / RAP (Rank-Assisted Prefilling)
Novel technique introduced
A frustratingly easy technique known as the prefilling attack has been shown to effectively circumvent the safety alignment of frontier LLMs by simply prefilling the assistant response with an affirmative prefix before decoding. In response, recent work proposed a supervised fine-tuning (SFT) defense using data augmentation to achieve a \enquote{deep} safety alignment, allowing the model to generate natural language refusals immediately following harmful prefills. Unfortunately, we show in this work that the "deep" safety alignment produced by such an approach is in fact not very deep. A generalization of the prefilling attack, which we refer to as the Rank-Assisted Prefilling (RAP) attack, can effectively extract harmful content from models fine-tuned with the data augmentation defense by selecting low-probability "harmful" tokens from the top 20 predicted next tokens at each step (thus ignoring high-probability "refusal" tokens). We argue that this vulnerability is enabled due to the "gaming" of the SFT objective when the target distribution entropies are low, where low fine-tuning loss is achieved by shifting large probability mass to a small number of refusal tokens while neglecting the high ranks of harmful tokens. We then propose a new perspective on achieving deep safety alignment by matching the token ranks of the target distribution, rather than their probabilities. This perspective yields a surprisingly simple fix to the data augmentation defense based on regularizing the attention placed on harmful prefill tokens, an approach we call PRefill attEntion STOpping (PRESTO). Adding PRESTO yields up to a 4.7x improvement in the mean StrongREJECT score under RAP attacks across three popular open-source LLMs, with low impact to model utility.
Key Contributions
- Rank-Assisted Prefilling (RAP) attack that bypasses 'deep' safety alignment by greedily selecting low-probability harmful tokens from the top-k next-token predictions, exploiting probability mass concentration on refusal tokens that leaves harmful tokens at accessible ranks
- Theoretical analysis showing that SFT data-augmentation defenses 'game' the training objective at low target-distribution entropy, achieving low loss without suppressing harmful token ranks
- PRESTO defense that regularizes attention on harmful prefill tokens to match token ranks — not just probabilities — yielding up to 4.7× improvement in StrongREJECT score under RAP attacks with minimal utility degradation