Latest papers

6 papers
defense arXiv Apr 18, 2026 · 4w ago

SafeDream: Safety World Model for Proactive Early Jailbreak Detection

Bo Yan, Weikai Lin, Yada Zhu et al. · University of Central Florida · University of Rochester +1 more

World-model-based early warning system that detects multi-turn jailbreak attacks 1+ turns before LLM compliance using safety state prediction

Prompt Injection nlp
PDF
benchmark arXiv Apr 3, 2026 · 6w ago

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

Zhangyun Tan, Zeliang Zhang, Susan Liang et al. · University of Rochester

Benchmark exposing that training-free VLM unlearning via prompts achieves instruction compliance but not genuine visual concept erasure

Model Inversion Attack Sensitive Information Disclosure visionmultimodal
PDF Code
attack arXiv Feb 6, 2026 · Feb 2026

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Mingqian Feng, Xiaodong Liu, Weiwei Yang et al. · University of Rochester · Microsoft Research

RL-trained multi-turn jailbreak attacker with intent-drift-aware reward achieves 80.1% ASR, beating SOTA by 33.9%

Prompt Injection nlp
1 citations 1 influentialPDF Code
benchmark arXiv Jan 30, 2026 · Jan 2026

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Mingqian Feng, Xiaodong Liu, Weiwei Yang et al. · University of Rochester · Microsoft Research

Statistical scaling law using Beta distributions to predict LLM jailbreak success rates at large N from small-budget measurements

Prompt Injection nlp
PDF
defense arXiv Oct 2, 2025 · Oct 2025

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Zhenyu Pan, Yiting Zhang, Zhuo Liu et al. · Northwestern University · University of Illinois at Chicago +2 more

Adversarial co-evolution MARL framework that trains LLM agents to resist jailbreaks and prompt injection without external guard modules

Prompt Injection Excessive Agency nlpreinforcement-learning
1 citations PDF
attack arXiv Sep 18, 2025 · Sep 2025

Discrete optimal transport is a strong audio adversarial attack

Anton Selitskiy, Akib Shahriyar, Jishnuraj Prakasan · University of Rochester · Rochester Institute of Technology

Attacks audio anti-spoofing ML classifiers via optimal transport distributional alignment without gradient access

Input Manipulation Attack audiogenerative
PDF