attack 2026

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

Yavuz Bakman , Duygu Nur Yaldiz , Salman Avestimehr , Sai Praneeth Karimireddy

University of Southern California

1 citations · 33 references · arXiv

Published on arXiv

2601.22313

Model Poisoning

OWASP ML Top 10 — ML10

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Proves theoretically and empirically that static black-box alignment provides zero guarantee of post-update alignment, with hidden misalignment capacity increasing with model scale — models can pass every standard safety test yet become catastrophically misaligned after a single benign gradient update.

Hair-Trigger Alignment

Novel technique introduced

Large Language Models (LLMs) are rarely static and are frequently updated in practice. A growing body of alignment research has shown that models initially deemed "aligned" can exhibit misaligned behavior after fine-tuning, such as forgetting jailbreak safety features or re-surfacing knowledge that was intended to be forgotten. These works typically assume that the initial model is aligned based on static black-box evaluation, i.e., the absence of undesired responses to a fixed set of queries. In contrast, we formalize model alignment in both the static and post-update settings and uncover a fundamental limitation of black-box evaluation. We theoretically show that, due to overparameterization, static alignment provides no guarantee of post-update alignment for any update dataset. Moreover, we prove that static black-box probing cannot distinguish a model that is genuinely post-update robust from one that conceals an arbitrary amount of adversarial behavior which can be activated by even a single benign gradient update. We further validate these findings empirically in LLMs across three core alignment domains: privacy, jailbreak safety, and behavioral honesty. We demonstrate the existence of LLMs that pass all standard black-box alignment tests, yet become severely misaligned after a single benign update. Finally, we show that the capacity to hide such latent adversarial behavior increases with model scale, confirming our theoretical prediction that post-update misalignment grows with the number of parameters. Together, our results highlight the inadequacy of static evaluation protocols and emphasize the urgent need for post-update-robust alignment evaluation.

Key Contributions

Formal theory proving that due to overparameterization, static black-box alignment evaluation cannot distinguish genuinely aligned models from those concealing arbitrary adversarial behavior activatable by a single benign gradient update
Construction of 'hair-trigger' LLMs that pass all standard alignment tests across jailbreak safety, privacy, and behavioral honesty but become severely misaligned after one benign update
Empirical demonstration that the capacity to hide latent adversarial behavior scales with model parameter count, confirming the theoretical prediction

🛡️ Threat Analysis

Model Poisoning

The paper constructs models with hidden adversarial behavior (latent misalignment) that passes all standard alignment tests — this is the core ML10 concept of embedding concealed malicious behavior inside an apparently well-behaved model. Although the 'trigger' is a benign gradient update rather than a test-time input pattern, the fundamental threat is a model harboring hidden targeted behavior undetectable through normal evaluation.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxtraining_time

Applications

large language model alignmentsafety evaluationjailbreak safety

Read PDF arXiv DOI

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

An Investigation on Group Query Hallucination Attacks

Exploring the Security Threats of Retriever Backdoors in Retrieval-Augmented Code Generation

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

Stateless Yet Not Forgetful: Implicit Memory as a Hidden Channel in LLMs

Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures