ML Security Papers

Latest papers

19 papers

tool arXiv Apr 21, 2026 · 4w ago

Reasoning-Aware AIGC Detection via Alignment and Reinforcement

Zhao Wang, Max Xiong, Jianxun Lian et al. · Renmin University of China · Duke University +1 more

Reasoning-driven AI text detector using reinforcement learning to generate interpretable explanations before classification across diverse LLM sources

Output Integrity Attack nlp

PDF Code

defense arXiv Apr 9, 2026 · 6w ago

$\oslash$ Source Models Leak What They Shouldn't $\nrightarrow$: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization

Arnav Devalapally, Poornima Jain, Kartik Srinivas et al. · Indian Institute of Technology Hyderabad · University of Michigan +2 more

Machine unlearning method that removes source-domain class knowledge during domain adaptation to prevent privacy leakage via zero-shot transfer

Model Inversion Attack vision

PDF Code

defense arXiv Mar 3, 2026 · 11w ago

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Aradhye Agarwal, Gurdit Siyan, Yash Pandya et al. · Microsoft Research

Post-training RL framework that teaches agentic LLMs to refuse harmful tool-use actions and resist prompt injection in multi-step settings

Prompt Injection Excessive Agency nlp

PDF

attack arXiv Feb 6, 2026 · Feb 2026

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Mingqian Feng, Xiaodong Liu, Weiwei Yang et al. · University of Rochester · Microsoft Research

RL-trained multi-turn jailbreak attacker with intent-drift-aware reward achieves 80.1% ASR, beating SOTA by 33.9%

Prompt Injection nlp

1 citations 1 influentialPDF Code

benchmark arXiv Jan 30, 2026 · Jan 2026

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Mingqian Feng, Xiaodong Liu, Weiwei Yang et al. · University of Rochester · Microsoft Research

Statistical scaling law using Beta distributions to predict LLM jailbreak success rates at large N from small-budget measurements

Prompt Injection nlp

PDF

benchmark arXiv Jan 26, 2026 · Jan 2026

Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming

Alexandra Chouldechova, A. Feder Cooper, Solon Barocas et al. · Microsoft Research · Microsoft

Critiques LLM jailbreak ASR comparisons as methodologically invalid using social science measurement theory and inferential statistics

Prompt Injection Benchmarks & Evaluation nlp

1 citations PDF

defense arXiv Dec 12, 2025 · Dec 2025

Learning to Extract Context for Context-Aware LLM Inference

Minseon Kim, Lucas Caccia, Zhengyan Shi et al. · Microsoft Research

RL-trained context extractor reduces LLM harmful outputs and over-refusals by inferring user intent before generating responses

Prompt Injection nlp

PDF

defense IACR ePrint Dec 9, 2025 · Dec 2025

Improved Pseudorandom Codes from Permuted Puzzles

Miranda Christ, Noah Golowich, Sam Gunn et al. · Columbia University · Microsoft Research +5 more

Constructs provably robust LLM watermarks with subexponential security, surviving worst-case edits and detection-key-aware adversaries

Output Integrity Attack nlp

PDF

benchmark arXiv Nov 14, 2025 · Nov 2025

Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting

Nirmit Arora, Sathvik Joel, Ishan Kavathekar et al. · Microsoft Research · International Institute of Information Technology +1 more

Benchmarks adversarial prompt vulnerabilities across five multi-agent LLM architectures using a new evaluation framework and diagnostic metric

Prompt Injection Excessive Agency nlp

2 citations PDF Code

benchmark arXiv Nov 7, 2025 · Nov 2025

TAMAS: Benchmarking Adversarial Risks in Multi-Agent LLM Systems

Ishan Kavathekar, Hemang Jain, Ameya Rathod et al. · International Institute of Information Technology · Microsoft Research

Benchmark evaluating six adversarial attack types against multi-agent LLM systems across 10 backbone LLMs and two agent frameworks

Prompt Injection Excessive Agency Benchmarks & Evaluation nlp

PDF Code

attack arXiv Nov 3, 2025 · Nov 2025

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

Hamin Koo, Minseon Kim, Jaehyung Kim · Yonsei University · Microsoft Research

Meta-optimized bi-level framework co-evolves jailbreak prompts and LLM judge templates to achieve SOTA attack success rates on Claude models

Prompt Injection nlp

1 citations PDF

defense arXiv Oct 31, 2025 · Oct 2025

BlurGuard: A Simple Approach for Robustifying Image Protection Against AI-Powered Editing

Jinsu Kim, Yunhun Nam, Minseon Kim et al. · Korea University · Microsoft Research

Defends adversarial image protections from reversal attacks by applying adaptive per-region Gaussian blur to adjust noise frequency spectrum

Output Integrity Attack visiongenerative

PDF Code

defense arXiv Oct 30, 2025 · Oct 2025

Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng, Vidhisha Balachandran, Chan Young Park et al. · The Ohio State University · Microsoft Research +1 more

Trains LLMs via RL on instruction-hierarchy data to resist jailbreaks and prompt injection, cutting attack success rates by 20%

Prompt Injection nlp

1 citations PDF Code

defense arXiv Oct 20, 2025 · Oct 2025

BlueCodeAgent: A Blue Teaming Agent Enabled by Automated Red Teaming for CodeGen AI

Chengquan Guo, Yuzhou Nie, Chulin Xie et al. · University of Chicago · UC Santa Barbara +3 more

Blue teaming agent for CodeGen LLMs using automated red teaming to detect malicious instructions and vulnerable code outputs

Prompt Injection Blue-Team Agents Vulnerability Discovery Red-Team Agents nlp

PDF

defense arXiv Oct 13, 2025 · Oct 2025

Information-Preserving Reformulation of Reasoning Traces for Antidistillation

Jiayu Ding, Lei Cui, Li Dong et al. · Xi’an Jiaotong University · Microsoft Research

Defends LLM reasoning traces against distillation-based model theft by reformulating self-talk removal and conclusion reordering

Model Theft Model Theft nlp

1 citations PDF

tool arXiv Oct 2, 2025 · Oct 2025

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Chengquan Guo, Chulin Xie, Yu Yang et al. · University of Chicago · University of Illinois Urbana-Champaign +5 more

Automated red-teaming agent that adaptively combines jailbreak tools to uncover safety vulnerabilities in LLM-based code agents

Prompt Injection nlp

4 citations PDF

tool arXiv Oct 2, 2025 · Oct 2025

VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL

Kyoungjun Park, Yifan Yang, Juheon Yi et al. · The University of Texas at Austin · Microsoft Research

Detects AI-generated videos via GRPO-fine-tuned MLLM with temporal artifact reward models, achieving >95% accuracy

Output Integrity Attack visionmultimodalgenerative

2 citations 1 influentialPDF Code

attack EMNLP Sep 23, 2025 · Sep 2025

Anecdoctoring: Automated Red-Teaming Across Language and Place

Alejandro Cuevas, Saloni Dash, Bharat Kumar Nayak et al. · Carnegie Mellon University · Microsoft Research +2 more

Automated multilingual red-teaming attack elicits LLM disinformation using knowledge graph-augmented adversarial prompt generation

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

2 citations 1 influentialPDF

defense arXiv Aug 17, 2025 · Aug 2025

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Minseon Kim, Jin Myung Kwak, Lama Alssum et al. · Microsoft Research · KAIST +5 more

Defends LLM safety during fine-tuning via hyperparameter tuning and EMA momentum, cutting harmful responses from 16% to 5%

Transfer Learning Attack Prompt Injection nlp

PDF

Latest papers

Reasoning-Aware AIGC Detection via Alignment and Reinforcement

$\oslash$ Source Models Leak What They Shouldn't $\nrightarrow$: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming

Learning to Extract Context for Context-Aware LLM Inference

Improved Pseudorandom Codes from Permuted Puzzles

Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting

TAMAS: Benchmarking Adversarial Risks in Multi-Agent LLM Systems

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

BlurGuard: A Simple Approach for Robustifying Image Protection Against AI-Powered Editing

Reasoning Up the Instruction Ladder for Controllable Language Models

BlueCodeAgent: A Blue Teaming Agent Enabled by Automated Red Teaming for CodeGen AI

Information-Preserving Reformulation of Reasoning Traces for Antidistillation

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL

Anecdoctoring: Automated Red-Teaming Across Language and Place

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue