defense arXiv Feb 19, 2026 · 6w ago
Zachary Coalson, Beth Sohler, Aiden Gabriel et al. · Oregon State University
Defends LLMs against jailbreaks by training multiple independent refusal pathways that attackers cannot simultaneously suppress
Prompt Injection nlp
We identify a structural weakness in current large language model (LLM) alignment: modern refusal mechanisms are fail-open. While existing approaches encode refusal behaviors across multiple latent features, suppressing a single dominant feature$-$via prompt-based jailbreaks$-$can cause alignment to collapse, leading to unsafe generation. Motivated by this, we propose fail-closed alignment as a design principle for robust LLM safety: refusal mechanisms should remain effective even under partial failures via redundant, independent causal pathways. We present a concrete instantiation of this principle: a progressive alignment framework that iteratively identifies and ablates previously learned refusal directions, forcing the model to reconstruct safety along new, independent subspaces. Across four jailbreak attacks, we achieve the strongest overall robustness while mitigating over-refusal and preserving generation quality, with small computational overhead. Our mechanistic analyses confirm that models trained with our method encode multiple, causally independent refusal directions that prompt-based jailbreaks cannot suppress simultaneously, providing empirical support for fail-closed alignment as a principled foundation for robust LLM safety.
llm transformer Oregon State University
attack arXiv Feb 19, 2026 · 6w ago
Zachary Coalson, Bo Fang, Sanghyun Hong · Oregon State University · University of Texas at Arlington
Discovers turn amplification as an LLM resource-exhaustion attack, using mechanistic activation analysis to enable persistent fine-tuning and parameter-corruption attack vectors
Model Poisoning Model Denial of Service nlp
Multi-turn interaction length is a dominant factor in the operational costs of conversational LLMs. In this work, we present a new failure mode in conversational LLMs: turn amplification, in which a model consistently prolongs multi-turn interactions without completing the underlying task. We show that an adversary can systematically exploit clarification-seeking behavior$-$commonly encouraged in multi-turn conversation settings$-$to scalably prolong interactions. Moving beyond prompt-level behaviors, we take a mechanistic perspective and identify a query-independent, universal activation subspace associated with clarification-seeking responses. Unlike prior cost-amplification attacks that rely on per-turn prompt optimization, our attack arises from conversational dynamics and persists across prompts and tasks. We show that this mechanism provides a scalable pathway to induce turn amplification: both supply-chain attacks via fine-tuning and runtime attacks through low-level parameter corruptions consistently shift models toward abstract, clarification-seeking behavior across prompts. Across multiple instruction-tuned LLMs and benchmarks, our attack substantially increases turn count while remaining compliant. We also show that existing defenses offer limited protection against this emerging class of failures.
llm transformer Oregon State University · University of Texas at Arlington
attack arXiv Feb 19, 2026 · 6w ago
Leo Marchyok, Zachary Coalson, Sungho Keum et al. · Oregon State University · Korea Advanced Institute of Science & Technology
Discovers universal activation directions in LLM residual streams that reliably amplify PII leakage beyond existing prompt-based extraction attacks
Model Inversion Attack Sensitive Information Disclosure nlp
Modern language models exhibit rich internal structure, yet little is known about how privacy-sensitive behaviors, such as personally identifiable information (PII) leakage, are represented and modulated within their hidden states. We present UniLeak, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model's residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts. These model-specific directions generalize across contexts and amplify PII generation probability, with minimal impact on generation quality. UniLeak recovers such directions without access to training data or groundtruth PII, relying only on self-generated text. Across multiple models and datasets, steering along these universal directions substantially increases PII leakage compared to existing prompt-based extraction methods. Our results offer a new perspective on PII leakage: the superposition of a latent signal in the model's representations, enabling both risk amplification and mitigation.
llm transformer Oregon State University · Korea Advanced Institute of Science & Technology