Latest papers

2 papers
benchmark arXiv Mar 1, 2026 · 5w ago

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Subramanyam Sahoo, Vinija Jain, Divya Chaudhary et al. · Independent · Meta AI +3 more

Exposes catastrophic silent failure of LLM toxicity safety classifiers under tiny embedding drift, defeating standard confidence-based monitoring

Prompt Injection nlp
PDF
defense arXiv Nov 26, 2025 · Nov 2025

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

Dongkyu Derek Cho, Huan Song, Arijit Ghosh Chowdhury et al. · Duke University · AWS Generative AI Innovation Center

Demonstrates RLVR fine-tuning maintains LLM safety guardrails while improving reasoning, breaking the assumed safety-capability tradeoff

Prompt Injection nlp
1 citations PDF