Latest papers

4 papers
benchmark arXiv Mar 1, 2026 · 5w ago

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Subramanyam Sahoo, Vinija Jain, Divya Chaudhary et al. · Independent · Meta AI +3 more

Exposes catastrophic silent failure of LLM toxicity safety classifiers under tiny embedding drift, defeating standard confidence-based monitoring

Prompt Injection nlp
PDF
benchmark arXiv Jan 2, 2026 · Jan 2026

A Comprehensive Dataset for Human vs. AI Generated Image Detection

Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz et al. · Kalyani Government Engineering College · AI Institute USC +12 more

Releases MS COCOAI, a 96K-image benchmark for detecting AI-generated images and attributing them to specific generative models

Output Integrity Attack visiongenerative
1 citations PDF Code
attack arXiv Dec 8, 2025 · Dec 2025

TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards

Xiqiao Xiong, Ouxiang Li, Zhuo Liu et al. · University of Science and Technology of China · National University of Singapore +1 more

RL-trained multi-turn jailbreak attacker using process rewards to guide trajectory-level LLM prompt optimization

Prompt Injection nlpreinforcement-learning
PDF Code
defense arXiv Aug 4, 2025 · Aug 2025

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Amitava Das, Vinija Jain, Aman Chadha · BITS Pilani · Meta AI +1 more

Traces LLM alignment failures to training corpus sources and defends against jailbreaks via inference filters, DPO regularization, and provenance-aware decoding

Prompt Injection nlp
PDF Code