benchmark 2026

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

0 citations

Published on arXiv

2603.01297

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Safety classifiers trained on frozen LLM embeddings collapse to near-random performance (50% ROC-AUC) under 2% embedding norm perturbations while maintaining near-unchanged confidence, with Gaussian, directional, and subspace drift all producing equivalent mechanism-invariant failure

Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $σ=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.

Key Contributions

Quantifies precise failure threshold: σ=0.02 embedding perturbations (~1° angular drift) collapse safety classifiers from 85% to 50% ROC-AUC, with failure cliff at σ=0.01–0.028
Characterizes dangerous silent failures where 72% of misclassifications occur with high confidence (>0.8) while mean confidence drops only 14%, defeating standard monitoring
Demonstrates alignment paradox: instruction-tuned models exhibit 20% worse class separability in embedding space than base models, making aligned systems harder to safeguard

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timewhite_box

Datasets

Civil Comments

Applications

llm safety classifierstoxicity detectioncontent moderation

Read PDF arXiv

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Towards mitigating information leakage when evaluating safety monitors

Unveiling the Latent Directions of Reflection in Large Language Models

RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks

What Matters For Safety Alignment?

Silenced Biases: The Dark Side LLMs Learned to Refuse

A Granular Study of Safety Pretraining under Model Abliteration

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift