defense 2025

Matching Ranks Over Probability Yields Truly Deep Safety Alignment

Jason Vega , Gagandeep Singh

University of Illinois Urbana-Champaign

0 citations · 44 references · arXiv

Published on arXiv

2512.05518

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

PRESTO achieves up to 4.7× improvement in mean StrongREJECT score under RAP attacks across three open-source LLMs (including Llama 2 7B Chat) with low impact to model utility.

PRESTO (PRefill attEntion STOpping) / RAP (Rank-Assisted Prefilling)

Novel technique introduced

A frustratingly easy technique known as the prefilling attack has been shown to effectively circumvent the safety alignment of frontier LLMs by simply prefilling the assistant response with an affirmative prefix before decoding. In response, recent work proposed a supervised fine-tuning (SFT) defense using data augmentation to achieve a \enquote{deep} safety alignment, allowing the model to generate natural language refusals immediately following harmful prefills. Unfortunately, we show in this work that the "deep" safety alignment produced by such an approach is in fact not very deep. A generalization of the prefilling attack, which we refer to as the Rank-Assisted Prefilling (RAP) attack, can effectively extract harmful content from models fine-tuned with the data augmentation defense by selecting low-probability "harmful" tokens from the top 20 predicted next tokens at each step (thus ignoring high-probability "refusal" tokens). We argue that this vulnerability is enabled due to the "gaming" of the SFT objective when the target distribution entropies are low, where low fine-tuning loss is achieved by shifting large probability mass to a small number of refusal tokens while neglecting the high ranks of harmful tokens. We then propose a new perspective on achieving deep safety alignment by matching the token ranks of the target distribution, rather than their probabilities. This perspective yields a surprisingly simple fix to the data augmentation defense based on regularizing the attention placed on harmful prefill tokens, an approach we call PRefill attEntion STOpping (PRESTO). Adding PRESTO yields up to a 4.7x improvement in the mean StrongREJECT score under RAP attacks across three popular open-source LLMs, with low impact to model utility.

Key Contributions

Rank-Assisted Prefilling (RAP) attack that bypasses 'deep' safety alignment by greedily selecting low-probability harmful tokens from the top-k next-token predictions, exploiting probability mass concentration on refusal tokens that leaves harmful tokens at accessible ranks
Theoretical analysis showing that SFT data-augmentation defenses 'game' the training objective at low target-distribution entropy, achieving low loss without suppressing harmful token ranks
PRESTO defense that regularizes attention on harmful prefill tokens to match token ranks — not just probabilities — yielding up to 4.7× improvement in StrongREJECT score under RAP attacks with minimal utility degradation

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

grey_boxinference_timetargeted

Datasets

StrongREJECT

Applications

llm safety alignmentchatbot safetyapi-accessible language models

Read PDF arXiv DOI Code

Matching Ranks Over Probability Yields Truly Deep Safety Alignment

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Distractor Injection Attacks on Large Reasoning Models: Characterization and Defense

RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models

No More, No Less: Least-Privilege Language Models

Building Production-Ready Probes For Gemini

Large Reasoning Models Learn Better Alignment from Flawed Thinking

Steering MoE LLMs via Expert (De)Activation

Prompt Injection Vulnerability of Consensus Generating Applications in Digital Democracy

Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs