ML Security Papers

Latest papers

53 papers

tool arXiv Apr 30, 2026 · 21d ago

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Yanting Wang, Chenlong Yin, Ying Chen et al. · The Pennsylvania State University

Efficient red-teaming framework achieving 2-7x speedup for optimization-based prompt injection and knowledge corruption attacks on long-context LLMs

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

attack arXiv Apr 24, 2026 · 27d ago

Training a General Purpose Automated Red Teaming Model

Aishwarya Padmakumar, Leon Derczynski, Traian Rebedea et al. · NVIDIA

Trains general-purpose LLM red teaming models that generalize to arbitrary adversarial goals without pre-existing evaluators

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF

tool arXiv Apr 23, 2026 · 28d ago

MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks

Run Hao, Zhuoran Tan · Aarhus University · University of Glasgow

Security testing framework for MCP tool servers detecting developer pitfalls through static analysis and trace-based validation

AI Supply Chain Attacks Insecure Plugin Design Prompt Injection Benchmarks & Evaluation Blue-Team Agents multimodalnlp

PDF

attack arXiv Apr 20, 2026 · 4w ago

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Yuan Fang, Yiming Luo, Aimin Zhou et al. · East China Normal University · Shanghai Innovation Institute

Automated red-teaming framework generating diverse toxic datasets via inverted constitutional AI to test LLM safety mechanisms

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

benchmark arXiv Apr 8, 2026 · 6w ago

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang et al. · CyCraft · National Taiwan University

Benchmark evaluating LLM safety guardrails on multi-step agent tool-calling trajectories across 12 risk categories including prompt injection

Prompt Injection Insecure Plugin Design Excessive Agency Benchmarks & Evaluation Blue-Team Agents nlp

PDF

attack arXiv Mar 26, 2026 · 8w ago

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

Ron Litvak · Columbia University

System prompt engineering creates exploitable phishing detection vulnerabilities in LLM email agents despite strong benchmark performance

Input Manipulation Attack Prompt Injection Excessive Agency Blue-Team Agents Benchmarks & Evaluation nlp

PDF

benchmark arXiv Mar 15, 2026 · 9w ago

When Scanners Lie: Evaluator Instability in LLM Red-Teaming

Lidor Erez, Omer Hofman, Tamir Nizri et al.

Automated LLM red-teaming scanners show unstable vulnerability measurements due to unreliable evaluators, varying ASR by up to 33%

Prompt Injection Benchmarks & Evaluation Red-Team Agents nlp

PDF

benchmark arXiv Mar 11, 2026 · 10w ago

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

Fabrizio Dimino, Bhaskarjit Sarmah, Stefano Pasquali · Domyn

Proposes risk-adjusted jailbreak evaluation framework and metric for LLMs deployed in banking and financial services

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF

benchmark arXiv Mar 10, 2026 · 10w ago

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Harry Owiredu-Ashley

Automated multi-turn red-teaming framework measures LLM guardrail degradation as continuous compliance trajectories, not binary jailbreak events

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF

survey arXiv Feb 24, 2026 · 12w ago

A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications

Shruti Srivastava, Kiranmayee Janardhan, Shaurya Jauhari · Infosys Limited

Surveys algorithmic red teaming methodologies for AI systems, covering automated attack tools, limitations, and future research gaps

Input Manipulation Attack Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF

benchmark arXiv Feb 23, 2026 · 12w ago

Red-Teaming Claude Opus and ChatGPT-based Security Advisors for Trusted Execution Environments

Kunal Mukherjee · Virginia Tech

Red-teams Claude Opus and ChatGPT as TEE security advisors, finding transferable prompt-induced failures and proposing an evaluation benchmark

Prompt Injection Benchmarks & Evaluation Triage & Prioritization nlp

1 citations PDF

benchmark arXiv Feb 18, 2026 · Feb 2026

AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

Tanqiu Jiang, Yuhui Wang, Jiacheng Liang et al. · Stony Brook University

Benchmark evaluating LLM agent susceptibility to five long-horizon attack types across 28 agentic environments and 644 test cases

Prompt Injection Excessive Agency Benchmarks & Evaluation nlp

1 citations PDF Code

benchmark arXiv Feb 18, 2026 · Feb 2026

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal et al. · Independent Researcher · EPFL +4 more

Benchmarks multi-turn, multilingual jailbreaking of LLM agents using a step-by-step illicit planning framework and novel time-to-jailbreak metrics

Prompt Injection Excessive Agency Red-Team Agents Benchmarks & Evaluation nlp

PDF

benchmark arXiv Feb 18, 2026 · Feb 2026

Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis

Scott Thornton · Perfecxion.ai

Benchmark study finds adversarial code comments fail to meaningfully fool LLM vulnerability detectors across eight frontier models in 14,012 trials

Prompt Injection Vulnerability Discovery Benchmarks & Evaluation nlp

PDF

tool arXiv Feb 9, 2026 · Feb 2026

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

Georgios Syros, Evan Rose, Brian Grinstead et al. · Northeastern University · Mozilla Corporation

Automated red-teaming framework that adaptively discovers indirect prompt injection attacks against LLM web agents via trajectory analysis

Prompt Injection Excessive Agency Red-Team Agents Benchmarks & Evaluation nlp

PDF

attack arXiv Feb 9, 2026 · Feb 2026

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Oliver Daniels, Perusha Moodley, Ben Marlin et al. · MATS · University of Massachusetts Amherst +1 more

Automated red-team pipeline generates system prompts that fool both black-box and white-box LLM alignment auditing methods via strategic deception

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

tool arXiv Feb 7, 2026 · Feb 2026

NAAMSE: Framework for Evolutionary Security Evaluation of Agents

Kunal Pai, Parth Shah, Harshil Patel · University of California · Independent Researcher

Evolutionary framework auto-generates and mutates adversarial prompts to uncover LLM agent jailbreaks missed by static red-teaming

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

benchmark arXiv Feb 7, 2026 · Feb 2026

Aegis: Towards Governance, Integrity, and Security of AI Voice Agents

Xiang Li, Pin-Yu Chen, Wenqi Wei · Fordham University · IBM Research

Red-teaming framework exposing behavioral vulnerabilities in AI voice agents via adversarial speech scenarios across banking, IT support, and logistics

Prompt Injection Excessive Agency Red-Team Agents Benchmarks & Evaluation audiomultimodalnlp

PDF

defense arXiv Feb 5, 2026 · Feb 2026

Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

Zhenxiong Yu, Zhi Yang, Zhiheng Jin et al. · SUFE · NUS +5 more

Event-driven LLM agent defense that selectively triggers hierarchical screening against prompt injection and multi-stage agent attacks

Prompt Injection Excessive Agency Blue-Team Agents Benchmarks & Evaluation nlp

PDF Code

attack arXiv Jan 30, 2026 · Jan 2026

Semantics-Preserving Evasion of LLM Vulnerability Detectors

Luze Sun, Alina Oprea, Eric Wong · Northeastern University · University of Pennsylvania

Carrier-constrained GCG attacks evade LLM-based code vulnerability detectors using behavior-preserving code transformations that transfer to black-box APIs

Input Manipulation Attack Vulnerability Discovery Benchmarks & Evaluation nlp

PDF Code

Loading more papers…

Latest papers

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Training a General Purpose Automated Red Teaming Model

MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

When Scanners Lie: Evaluator Instability in LLM Red-Teaming

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications

Red-Teaming Claude Opus and ChatGPT-based Security Advisors for Trusted Execution Environments

AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

NAAMSE: Framework for Evolutionary Security Evaluation of Agents

Aegis: Towards Governance, Integrity, and Security of AI Voice Agents

Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

Semantics-Preserving Evasion of LLM Vulnerability Detectors

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue