Latest papers

17 papers
defense arXiv Mar 3, 2026 · 4w ago

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Aradhye Agarwal, Gurdit Siyan, Yash Pandya et al. · Microsoft Research

Post-training RL framework that teaches agentic LLMs to refuse harmful tool-use actions and resist prompt injection in multi-step settings

Prompt Injection Excessive Agency nlp
PDF
attack arXiv Feb 6, 2026 · 8w ago

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Mingqian Feng, Xiaodong Liu, Weiwei Yang et al. · University of Rochester · Microsoft Research

RL-trained multi-turn jailbreak attacker with intent-drift-aware reward achieves 80.1% ASR, beating SOTA by 33.9%

Prompt Injection nlp
1 citations 1 influentialPDF Code
benchmark arXiv Jan 30, 2026 · 9w ago

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Mingqian Feng, Xiaodong Liu, Weiwei Yang et al. · University of Rochester · Microsoft Research

Statistical scaling law using Beta distributions to predict LLM jailbreak success rates at large N from small-budget measurements

Prompt Injection nlp
PDF
benchmark arXiv Jan 26, 2026 · 10w ago

Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming

Alexandra Chouldechova, A. Feder Cooper, Solon Barocas et al. · Microsoft Research · Microsoft

Critiques LLM jailbreak ASR comparisons as methodologically invalid using social science measurement theory and inferential statistics

Prompt Injection nlp
1 citations PDF
defense arXiv Dec 12, 2025 · Dec 2025

Learning to Extract Context for Context-Aware LLM Inference

Minseon Kim, Lucas Caccia, Zhengyan Shi et al. · Microsoft Research

RL-trained context extractor reduces LLM harmful outputs and over-refusals by inferring user intent before generating responses

Prompt Injection nlp
PDF
defense IACR ePrint Dec 9, 2025 · Dec 2025

Improved Pseudorandom Codes from Permuted Puzzles

Miranda Christ, Noah Golowich, Sam Gunn et al. · Columbia University · Microsoft Research +5 more

Constructs provably robust LLM watermarks with subexponential security, surviving worst-case edits and detection-key-aware adversaries

Output Integrity Attack nlp
PDF
benchmark arXiv Nov 14, 2025 · Nov 2025

Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting

Nirmit Arora, Sathvik Joel, Ishan Kavathekar et al. · Microsoft Research · International Institute of Information Technology +1 more

Benchmarks adversarial prompt vulnerabilities across five multi-agent LLM architectures using a new evaluation framework and diagnostic metric

Prompt Injection Excessive Agency nlp
2 citations PDF Code
benchmark arXiv Nov 7, 2025 · Nov 2025

TAMAS: Benchmarking Adversarial Risks in Multi-Agent LLM Systems

Ishan Kavathekar, Hemang Jain, Ameya Rathod et al. · International Institute of Information Technology · Microsoft Research

Benchmark evaluating six adversarial attack types against multi-agent LLM systems across 10 backbone LLMs and two agent frameworks

Prompt Injection Excessive Agency nlp
PDF Code
attack arXiv Nov 3, 2025 · Nov 2025

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

Hamin Koo, Minseon Kim, Jaehyung Kim · Yonsei University · Microsoft Research

Meta-optimized bi-level framework co-evolves jailbreak prompts and LLM judge templates to achieve SOTA attack success rates on Claude models

Prompt Injection nlp
1 citations PDF
defense arXiv Oct 31, 2025 · Oct 2025

BlurGuard: A Simple Approach for Robustifying Image Protection Against AI-Powered Editing

Jinsu Kim, Yunhun Nam, Minseon Kim et al. · Korea University · Microsoft Research

Defends adversarial image protections from reversal attacks by applying adaptive per-region Gaussian blur to adjust noise frequency spectrum

Output Integrity Attack visiongenerative
PDF Code
defense arXiv Oct 30, 2025 · Oct 2025

Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng, Vidhisha Balachandran, Chan Young Park et al. · The Ohio State University · Microsoft Research +1 more

Trains LLMs via RL on instruction-hierarchy data to resist jailbreaks and prompt injection, cutting attack success rates by 20%

Prompt Injection nlp
1 citations PDF Code
defense arXiv Oct 20, 2025 · Oct 2025

BlueCodeAgent: A Blue Teaming Agent Enabled by Automated Red Teaming for CodeGen AI

Chengquan Guo, Yuzhou Nie, Chulin Xie et al. · University of Chicago · UC Santa Barbara +3 more

Blue teaming agent for CodeGen LLMs using automated red teaming to detect malicious instructions and vulnerable code outputs

Prompt Injection nlp
PDF
defense arXiv Oct 13, 2025 · Oct 2025

Information-Preserving Reformulation of Reasoning Traces for Antidistillation

Jiayu Ding, Lei Cui, Li Dong et al. · Xi’an Jiaotong University · Microsoft Research

Defends LLM reasoning traces against distillation-based model theft by reformulating self-talk removal and conclusion reordering

Model Theft Model Theft nlp
1 citations PDF
tool arXiv Oct 2, 2025 · Oct 2025

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Chengquan Guo, Chulin Xie, Yu Yang et al. · University of Chicago · University of Illinois Urbana-Champaign +5 more

Automated red-teaming agent that adaptively combines jailbreak tools to uncover safety vulnerabilities in LLM-based code agents

Prompt Injection nlp
4 citations PDF
tool arXiv Oct 2, 2025 · Oct 2025

VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL

Kyoungjun Park, Yifan Yang, Juheon Yi et al. · The University of Texas at Austin · Microsoft Research

Detects AI-generated videos via GRPO-fine-tuned MLLM with temporal artifact reward models, achieving >95% accuracy

Output Integrity Attack visionmultimodalgenerative
2 citations 1 influentialPDF Code
attack EMNLP Sep 23, 2025 · Sep 2025

Anecdoctoring: Automated Red-Teaming Across Language and Place

Alejandro Cuevas, Saloni Dash, Bharat Kumar Nayak et al. · Carnegie Mellon University · Microsoft Research +2 more

Automated multilingual red-teaming attack elicits LLM disinformation using knowledge graph-augmented adversarial prompt generation

Prompt Injection nlp
2 citations 1 influentialPDF
defense arXiv Aug 17, 2025 · Aug 2025

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Minseon Kim, Jin Myung Kwak, Lama Alssum et al. · Microsoft Research · KAIST +5 more

Defends LLM safety during fine-tuning via hyperparameter tuning and EMA momentum, cutting harmful responses from 16% to 5%

Transfer Learning Attack Prompt Injection nlp
PDF