ML Security Papers

Latest papers

63 papers

tool arXiv Apr 30, 2026 · 21d ago

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Yanting Wang, Chenlong Yin, Ying Chen et al. · The Pennsylvania State University

Efficient red-teaming framework achieving 2-7x speedup for optimization-based prompt injection and knowledge corruption attacks on long-context LLMs

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

attack arXiv Apr 24, 2026 · 27d ago

Training a General Purpose Automated Red Teaming Model

Aishwarya Padmakumar, Leon Derczynski, Traian Rebedea et al. · NVIDIA

Trains general-purpose LLM red teaming models that generalize to arbitrary adversarial goals without pre-existing evaluators

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF

attack arXiv Apr 20, 2026 · 4w ago

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Yuan Fang, Yiming Luo, Aimin Zhou et al. · East China Normal University · Shanghai Innovation Institute

Automated red-teaming framework generating diverse toxic datasets via inverted constitutional AI to test LLM safety mechanisms

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

attack arXiv Mar 25, 2026 · 8w ago

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov et al. · MATS · ELLIS Institute Tübingen +3 more

AI agent autonomously discovers novel white-box jailbreak attacks outperforming 30+ existing methods with 100% ASR on target models

Input Manipulation Attack Prompt Injection Red-Team Agents Exploit Generation nlp

PDF Code

attack arXiv Mar 19, 2026 · 9w ago

Automated Membership Inference Attacks: Discovering MIA Signal Computations using LLM Agents

Toan Tran, Olivera Kotevska, Li Xiong · Emory University · Oak Ridge National Laboratory

LLM-agent framework that automatically discovers novel membership inference attack strategies, achieving 0.18 AUC improvement over existing MIAs

Membership Inference Attack Vulnerability Discovery Red-Team Agents

PDF

tool arXiv Mar 18, 2026 · 9w ago

VeriGrey: Greybox Agent Validation

Yuntong Zhang, Sungmin Kang, Ruijie Meng et al. · National University of Singapore · Max-Planck Institute of Security and Privacy

Greybox fuzzing framework that discovers indirect prompt injection vulnerabilities in LLM agents by mutating prompts and tracking tool invocations

Prompt Injection Excessive Agency Red-Team Agents Fuzzing & Test Generation nlp

PDF

benchmark arXiv Mar 15, 2026 · 9w ago

When Scanners Lie: Evaluator Instability in LLM Red-Teaming

Lidor Erez, Omer Hofman, Tamir Nizri et al.

Automated LLM red-teaming scanners show unstable vulnerability measurements due to unreliable evaluators, varying ASR by up to 33%

Prompt Injection Benchmarks & Evaluation Red-Team Agents nlp

PDF

attack arXiv Mar 13, 2026 · 9w ago

PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses

Chenlong Yin, Runpeng Geng, Yanting Wang et al. · The Pennsylvania State University

RL-based adaptive prompt injection attack that systematically breaks state-of-the-art LLM defenses using entropy regularization and advantage weighting

Prompt Injection Red-Team Agents nlp

PDF Code

benchmark arXiv Mar 11, 2026 · 10w ago

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

Fabrizio Dimino, Bhaskarjit Sarmah, Stefano Pasquali · Domyn

Proposes risk-adjusted jailbreak evaluation framework and metric for LLMs deployed in banking and financial services

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF

benchmark arXiv Mar 10, 2026 · 10w ago

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Harry Owiredu-Ashley

Automated multi-turn red-teaming framework measures LLM guardrail degradation as continuous compliance trajectories, not binary jailbreak events

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF

survey arXiv Feb 24, 2026 · 12w ago

A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications

Shruti Srivastava, Kiranmayee Janardhan, Shaurya Jauhari · Infosys Limited

Surveys algorithmic red teaming methodologies for AI systems, covering automated attack tools, limitations, and future research gaps

Input Manipulation Attack Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF

benchmark arXiv Feb 18, 2026 · Feb 2026

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal et al. · Independent Researcher · EPFL +4 more

Benchmarks multi-turn, multilingual jailbreaking of LLM agents using a step-by-step illicit planning framework and novel time-to-jailbreak metrics

Prompt Injection Excessive Agency Red-Team Agents Benchmarks & Evaluation nlp

PDF

tool arXiv Feb 9, 2026 · Feb 2026

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

Georgios Syros, Evan Rose, Brian Grinstead et al. · Northeastern University · Mozilla Corporation

Automated red-teaming framework that adaptively discovers indirect prompt injection attacks against LLM web agents via trajectory analysis

Prompt Injection Excessive Agency Red-Team Agents Benchmarks & Evaluation nlp

PDF

attack arXiv Feb 9, 2026 · Feb 2026

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Oliver Daniels, Perusha Moodley, Ben Marlin et al. · MATS · University of Massachusetts Amherst +1 more

Automated red-team pipeline generates system prompts that fool both black-box and white-box LLM alignment auditing methods via strategic deception

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

tool arXiv Feb 7, 2026 · Feb 2026

NAAMSE: Framework for Evolutionary Security Evaluation of Agents

Kunal Pai, Parth Shah, Harshil Patel · University of California · Independent Researcher

Evolutionary framework auto-generates and mutates adversarial prompts to uncover LLM agent jailbreaks missed by static red-teaming

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

benchmark arXiv Feb 7, 2026 · Feb 2026

Aegis: Towards Governance, Integrity, and Security of AI Voice Agents

Xiang Li, Pin-Yu Chen, Wenqi Wei · Fordham University · IBM Research

Red-teaming framework exposing behavioral vulnerabilities in AI voice agents via adversarial speech scenarios across banking, IT support, and logistics

Prompt Injection Excessive Agency Red-Team Agents Benchmarks & Evaluation audiomultimodalnlp

PDF

attack arXiv Jan 30, 2026 · Jan 2026

Whispers of Wealth: Red-Teaming Google's Agent Payments Protocol via Prompt Injection

Tanusree Debi, Wentian Zhu · University of Georgia

Red-teams Google's AP2 payment protocol via prompt injection attacks that hijack agent purchasing decisions and extract sensitive user payment data

Prompt Injection Sensitive Information Disclosure Red-Team Agents nlp

PDF

defense arXiv Jan 27, 2026 · Jan 2026

RvB: Automating AI System Hardening via Iterative Red-Blue Games

Lige Huang, Zicheng Liu, Jie Zhang et al. · Shanghai Artificial Intelligence Laboratory · Institute of Information Engineering +1 more

Automates LLM jailbreak guardrail hardening via iterative red-blue adversarial game without model parameter updates

Prompt Injection Red-Team Agents Patch & Remediation Blue-Team Agents nlp

PDF

attack arXiv Jan 20, 2026 · Jan 2026

AgenticRed: Optimizing Agentic Systems for Automated Red-teaming

Jiayi Yuan, Jonathan Nöther, Natasha Jaques et al. · University of Washington · Max Planck Institute for Software Systems

Evolutionary meta-search automatically designs agentic jailbreak pipelines achieving 96-100% ASR on Llama, GPT-4o, and Claude

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF

tool arXiv Jan 16, 2026 · Jan 2026

AJAR: Adaptive Jailbreak Architecture for Red-teaming

Yipu Dou, Wang Yang · Southeast University

Modular agentic red-teaming framework using MCP to orchestrate multi-turn jailbreak algorithms against tool-using LLM agents

Prompt Injection Excessive Agency Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

Loading more papers…

Latest papers

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Training a General Purpose Automated Red Teaming Model

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Automated Membership Inference Attacks: Discovering MIA Signal Computations using LLM Agents

VeriGrey: Greybox Agent Validation

When Scanners Lie: Evaluator Instability in LLM Red-Teaming

PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

NAAMSE: Framework for Evolutionary Security Evaluation of Agents

Aegis: Towards Governance, Integrity, and Security of AI Voice Agents

Whispers of Wealth: Red-Teaming Google's Agent Payments Protocol via Prompt Injection

RvB: Automating AI System Hardening via Iterative Red-Blue Games

AgenticRed: Optimizing Agentic Systems for Automated Red-teaming

AJAR: Adaptive Jailbreak Architecture for Red-teaming

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue