Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment
Yavuz Bakman , Duygu Nur Yaldiz , Salman Avestimehr , Sai Praneeth Karimireddy
Published on arXiv
2601.22313
Model Poisoning
OWASP ML Top 10 — ML10
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Proves theoretically and empirically that static black-box alignment provides zero guarantee of post-update alignment, with hidden misalignment capacity increasing with model scale — models can pass every standard safety test yet become catastrophically misaligned after a single benign gradient update.
Hair-Trigger Alignment
Novel technique introduced
Large Language Models (LLMs) are rarely static and are frequently updated in practice. A growing body of alignment research has shown that models initially deemed "aligned" can exhibit misaligned behavior after fine-tuning, such as forgetting jailbreak safety features or re-surfacing knowledge that was intended to be forgotten. These works typically assume that the initial model is aligned based on static black-box evaluation, i.e., the absence of undesired responses to a fixed set of queries. In contrast, we formalize model alignment in both the static and post-update settings and uncover a fundamental limitation of black-box evaluation. We theoretically show that, due to overparameterization, static alignment provides no guarantee of post-update alignment for any update dataset. Moreover, we prove that static black-box probing cannot distinguish a model that is genuinely post-update robust from one that conceals an arbitrary amount of adversarial behavior which can be activated by even a single benign gradient update. We further validate these findings empirically in LLMs across three core alignment domains: privacy, jailbreak safety, and behavioral honesty. We demonstrate the existence of LLMs that pass all standard black-box alignment tests, yet become severely misaligned after a single benign update. Finally, we show that the capacity to hide such latent adversarial behavior increases with model scale, confirming our theoretical prediction that post-update misalignment grows with the number of parameters. Together, our results highlight the inadequacy of static evaluation protocols and emphasize the urgent need for post-update-robust alignment evaluation.
Key Contributions
- Formal theory proving that due to overparameterization, static black-box alignment evaluation cannot distinguish genuinely aligned models from those concealing arbitrary adversarial behavior activatable by a single benign gradient update
- Construction of 'hair-trigger' LLMs that pass all standard alignment tests across jailbreak safety, privacy, and behavioral honesty but become severely misaligned after one benign update
- Empirical demonstration that the capacity to hide latent adversarial behavior scales with model parameter count, confirming the theoretical prediction
🛡️ Threat Analysis
The paper constructs models with hidden adversarial behavior (latent misalignment) that passes all standard alignment tests — this is the core ML10 concept of embedding concealed malicious behavior inside an apparently well-behaved model. Although the 'trigger' is a benign gradient update rather than a test-time input pattern, the fundamental threat is a model harboring hidden targeted behavior undetectable through normal evaluation.