benchmark 2026

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Subramanyam Sahoo 1,2, Vinija Jain 3, Divya Chaudhary 4, Aman Chadha 5,3

0 citations

α

Published on arXiv

2603.01297

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Safety classifiers trained on frozen LLM embeddings collapse to near-random performance (50% ROC-AUC) under 2% embedding norm perturbations while maintaining near-unchanged confidence, with Gaussian, directional, and subspace drift all producing equivalent mechanism-invariant failure


Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $σ=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.


Key Contributions

  • Quantifies precise failure threshold: σ=0.02 embedding perturbations (~1° angular drift) collapse safety classifiers from 85% to 50% ROC-AUC, with failure cliff at σ=0.01–0.028
  • Characterizes dangerous silent failures where 72% of misclassifications occur with high confidence (>0.8) while mean confidence drops only 14%, defeating standard monitoring
  • Demonstrates alignment paradox: instruction-tuned models exhibit 20% worse class separability in embedding space than base models, making aligned systems harder to safeguard

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timewhite_box
Datasets
Civil Comments
Applications
llm safety classifierstoxicity detectioncontent moderation