I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift
Subramanyam Sahoo 1,2, Vinija Jain 3, Divya Chaudhary 4, Aman Chadha 5,3
2 Meta AI
Published on arXiv
2603.01297
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Safety classifiers trained on frozen LLM embeddings collapse to near-random performance (50% ROC-AUC) under 2% embedding norm perturbations while maintaining near-unchanged confidence, with Gaussian, directional, and subspace drift all producing equivalent mechanism-invariant failure
Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $σ=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.
Key Contributions
- Quantifies precise failure threshold: σ=0.02 embedding perturbations (~1° angular drift) collapse safety classifiers from 85% to 50% ROC-AUC, with failure cliff at σ=0.01–0.028
- Characterizes dangerous silent failures where 72% of misclassifications occur with high confidence (>0.8) while mean confidence drops only 14%, defeating standard monitoring
- Demonstrates alignment paradox: instruction-tuned models exhibit 20% worse class separability in embedding space than base models, making aligned systems harder to safeguard