ML Security Papers

Latest papers

4 papers

benchmark arXiv Mar 1, 2026 · 5w ago

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Subramanyam Sahoo, Vinija Jain, Divya Chaudhary et al. · Independent · Meta AI +3 more

Exposes catastrophic silent failure of LLM toxicity safety classifiers under tiny embedding drift, defeating standard confidence-based monitoring

Prompt Injection nlp

PDF

benchmark arXiv Jan 2, 2026 · Jan 2026

A Comprehensive Dataset for Human vs. AI Generated Image Detection

Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz et al. · Kalyani Government Engineering College · AI Institute USC +12 more

Releases MS COCOAI, a 96K-image benchmark for detecting AI-generated images and attributing them to specific generative models

Output Integrity Attack visiongenerative

1 citations PDF Code

attack arXiv Dec 8, 2025 · Dec 2025

TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards

Xiqiao Xiong, Ouxiang Li, Zhuo Liu et al. · University of Science and Technology of China · National University of Singapore +1 more

RL-trained multi-turn jailbreak attacker using process rewards to guide trajectory-level LLM prompt optimization

Prompt Injection nlpreinforcement-learning

PDF Code

defense arXiv Aug 4, 2025 · Aug 2025

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Amitava Das, Vinija Jain, Aman Chadha · BITS Pilani · Meta AI +1 more

Traces LLM alignment failures to training corpus sources and defends against jailbreaks via inference filters, DPO regularization, and provenance-aware decoding

Prompt Injection nlp

PDF Code

Latest papers

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

A Comprehensive Dataset for Human vs. AI Generated Image Detection

TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue