Latest papers

4 papers
attack arXiv Feb 6, 2026 · 8w ago

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Mingqian Feng, Xiaodong Liu, Weiwei Yang et al. · University of Rochester · Microsoft Research

RL-trained multi-turn jailbreak attacker with intent-drift-aware reward achieves 80.1% ASR, beating SOTA by 33.9%

Prompt Injection nlp
1 citations 1 influentialPDF Code
benchmark arXiv Jan 30, 2026 · 9w ago

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Mingqian Feng, Xiaodong Liu, Weiwei Yang et al. · University of Rochester · Microsoft Research

Statistical scaling law using Beta distributions to predict LLM jailbreak success rates at large N from small-budget measurements

Prompt Injection nlp
PDF
defense arXiv Oct 2, 2025 · Oct 2025

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Zhenyu Pan, Yiting Zhang, Zhuo Liu et al. · Northwestern University · University of Illinois at Chicago +2 more

Adversarial co-evolution MARL framework that trains LLM agents to resist jailbreaks and prompt injection without external guard modules

Prompt Injection Excessive Agency nlpreinforcement-learning
1 citations PDF
attack arXiv Sep 18, 2025 · Sep 2025

Discrete optimal transport is a strong audio adversarial attack

Anton Selitskiy, Akib Shahriyar, Jishnuraj Prakasan · University of Rochester · Rochester Institute of Technology

Attacks audio anti-spoofing ML classifiers via optimal transport distributional alignment without gradient access

Input Manipulation Attack audiogenerative
PDF