attack 2026

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

Yavuz Bakman , Duygu Nur Yaldiz , Salman Avestimehr , Sai Praneeth Karimireddy

1 citations · 33 references · arXiv

α

Published on arXiv

2601.22313

Model Poisoning

OWASP ML Top 10 — ML10

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Proves theoretically and empirically that static black-box alignment provides zero guarantee of post-update alignment, with hidden misalignment capacity increasing with model scale — models can pass every standard safety test yet become catastrophically misaligned after a single benign gradient update.

Hair-Trigger Alignment

Novel technique introduced


Large Language Models (LLMs) are rarely static and are frequently updated in practice. A growing body of alignment research has shown that models initially deemed "aligned" can exhibit misaligned behavior after fine-tuning, such as forgetting jailbreak safety features or re-surfacing knowledge that was intended to be forgotten. These works typically assume that the initial model is aligned based on static black-box evaluation, i.e., the absence of undesired responses to a fixed set of queries. In contrast, we formalize model alignment in both the static and post-update settings and uncover a fundamental limitation of black-box evaluation. We theoretically show that, due to overparameterization, static alignment provides no guarantee of post-update alignment for any update dataset. Moreover, we prove that static black-box probing cannot distinguish a model that is genuinely post-update robust from one that conceals an arbitrary amount of adversarial behavior which can be activated by even a single benign gradient update. We further validate these findings empirically in LLMs across three core alignment domains: privacy, jailbreak safety, and behavioral honesty. We demonstrate the existence of LLMs that pass all standard black-box alignment tests, yet become severely misaligned after a single benign update. Finally, we show that the capacity to hide such latent adversarial behavior increases with model scale, confirming our theoretical prediction that post-update misalignment grows with the number of parameters. Together, our results highlight the inadequacy of static evaluation protocols and emphasize the urgent need for post-update-robust alignment evaluation.


Key Contributions

  • Formal theory proving that due to overparameterization, static black-box alignment evaluation cannot distinguish genuinely aligned models from those concealing arbitrary adversarial behavior activatable by a single benign gradient update
  • Construction of 'hair-trigger' LLMs that pass all standard alignment tests across jailbreak safety, privacy, and behavioral honesty but become severely misaligned after one benign update
  • Empirical demonstration that the capacity to hide latent adversarial behavior scales with model parameter count, confirming the theoretical prediction

🛡️ Threat Analysis

Model Poisoning

The paper constructs models with hidden adversarial behavior (latent misalignment) that passes all standard alignment tests — this is the core ML10 concept of embedding concealed malicious behavior inside an apparently well-behaved model. Although the 'trigger' is a benign gradient update rather than a test-time input pattern, the fundamental threat is a model harboring hidden targeted behavior undetectable through normal evaluation.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxtraining_time
Applications
large language model alignmentsafety evaluationjailbreak safety