ML Security Papers

Latest papers

2 papers

benchmark arXiv Mar 1, 2026 · 5w ago

Subramanyam Sahoo, Vinija Jain, Divya Chaudhary et al. · Independent · Meta AI +3 more

Exposes catastrophic silent failure of LLM toxicity safety classifiers under tiny embedding drift, defeating standard confidence-based monitoring

Prompt Injection nlp

defense arXiv Nov 26, 2025 · Nov 2025

Dongkyu Derek Cho, Huan Song, Arijit Ghosh Chowdhury et al. · Duke University · AWS Generative AI Innovation Center

Demonstrates RLVR fine-tuning maintains LLM safety guardrails while improving reasoning, breaking the assumed safety-capability tradeoff

Prompt Injection nlp

1 citations PDF