ML Security Papers

Latest papers

100 papers

tool arXiv Apr 30, 2026 · 21d ago

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Yanting Wang, Chenlong Yin, Ying Chen et al. · The Pennsylvania State University

Efficient red-teaming framework achieving 2-7x speedup for optimization-based prompt injection and knowledge corruption attacks on long-context LLMs

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

defense arXiv Apr 27, 2026 · 24d ago

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Jiaqi Li, Yang Zhao, Bin Sun et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +3 more

Self-play security training framework teaching AI agents to detect prompt injection, memory poisoning, and supply-chain attacks via role alternation

AI Supply Chain Attacks Prompt Injection Excessive Agency Blue-Team Agents nlp

PDF

defense arXiv Apr 25, 2026 · 26d ago

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

Yuandao Cai, Wensheng Tang, Cheng Wen et al. · The Hong Kong University of Science and Technology · Xidian University

Taint tracking framework that detects malicious data flows in LLM agents from untrusted sources to privileged actions

Prompt Injection Insecure Plugin Design Blue-Team Agents nlp

PDF

attack arXiv Apr 24, 2026 · 27d ago

Training a General Purpose Automated Red Teaming Model

Aishwarya Padmakumar, Leon Derczynski, Traian Rebedea et al. · NVIDIA

Trains general-purpose LLM red teaming models that generalize to arbitrary adversarial goals without pre-existing evaluators

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF

tool arXiv Apr 23, 2026 · 28d ago

MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks

Run Hao, Zhuoran Tan · Aarhus University · University of Glasgow

Security testing framework for MCP tool servers detecting developer pitfalls through static analysis and trace-based validation

AI Supply Chain Attacks Insecure Plugin Design Prompt Injection Benchmarks & Evaluation Blue-Team Agents multimodalnlp

PDF

attack arXiv Apr 20, 2026 · 4w ago

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Yuan Fang, Yiming Luo, Aimin Zhou et al. · East China Normal University · Shanghai Innovation Institute

Automated red-teaming framework generating diverse toxic datasets via inverted constitutional AI to test LLM safety mechanisms

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

benchmark arXiv Apr 8, 2026 · 6w ago

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang et al. · CyCraft · National Taiwan University

Benchmark evaluating LLM safety guardrails on multi-step agent tool-calling trajectories across 12 risk categories including prompt injection

Prompt Injection Insecure Plugin Design Excessive Agency Benchmarks & Evaluation Blue-Team Agents nlp

PDF

defense arXiv Mar 26, 2026 · 8w ago

Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

Hieu Xuan Le, Benjamin Goh, Quy Anh Tang · GovTech

Lightweight LLM judges with structured reasoning detect jailbreaks and prompt injections in production chatbots under strict latency constraints

Prompt Injection Blue-Team Agents nlp

PDF

attack arXiv Mar 26, 2026 · 8w ago

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

Ron Litvak · Columbia University

System prompt engineering creates exploitable phishing detection vulnerabilities in LLM email agents despite strong benchmark performance

Input Manipulation Attack Prompt Injection Excessive Agency Blue-Team Agents Benchmarks & Evaluation nlp

PDF

attack arXiv Mar 25, 2026 · 8w ago

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov et al. · MATS · ELLIS Institute Tübingen +3 more

AI agent autonomously discovers novel white-box jailbreak attacks outperforming 30+ existing methods with 100% ASR on target models

Input Manipulation Attack Prompt Injection Red-Team Agents Exploit Generation nlp

PDF Code

defense arXiv Mar 22, 2026 · 8w ago

Emergent Formal Verification: How an Autonomous AI Ecosystem Independently Discovered SMT-Based Safety Across Six Domains

Octavian Untila · Aisophical SRL

Autonomous AI system independently discovers SMT-based formal verification for AI safety across six domains with 100% accuracy

Output Integrity Attack Insecure Plugin Design Excessive Agency Vulnerability Discovery Patch & Remediation nlpmultimodal

PDF

attack arXiv Mar 19, 2026 · 9w ago

Automated Membership Inference Attacks: Discovering MIA Signal Computations using LLM Agents

Toan Tran, Olivera Kotevska, Li Xiong · Emory University · Oak Ridge National Laboratory

LLM-agent framework that automatically discovers novel membership inference attack strategies, achieving 0.18 AUC improvement over existing MIAs

Membership Inference Attack Vulnerability Discovery Red-Team Agents

PDF

tool arXiv Mar 18, 2026 · 9w ago

VeriGrey: Greybox Agent Validation

Yuntong Zhang, Sungmin Kang, Ruijie Meng et al. · National University of Singapore · Max-Planck Institute of Security and Privacy

Greybox fuzzing framework that discovers indirect prompt injection vulnerabilities in LLM agents by mutating prompts and tracking tool invocations

Prompt Injection Excessive Agency Red-Team Agents Fuzzing & Test Generation nlp

PDF

defense arXiv Mar 18, 2026 · 9w ago

Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare

Saikat Maiti · Commure · nFactor Technologies

Zero-trust architecture for healthcare AI agents using kernel isolation, credential proxies, network policies, and prompt integrity framework

AI Supply Chain Attacks Prompt Injection Excessive Agency Blue-Team Agents nlp

PDF

benchmark arXiv Mar 15, 2026 · 9w ago

When Scanners Lie: Evaluator Instability in LLM Red-Teaming

Lidor Erez, Omer Hofman, Tamir Nizri et al.

Automated LLM red-teaming scanners show unstable vulnerability measurements due to unreliable evaluators, varying ASR by up to 33%

Prompt Injection Benchmarks & Evaluation Red-Team Agents nlp

PDF

attack arXiv Mar 13, 2026 · 9w ago

PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses

Chenlong Yin, Runpeng Geng, Yanting Wang et al. · The Pennsylvania State University

RL-based adaptive prompt injection attack that systematically breaks state-of-the-art LLM defenses using entropy regularization and advantage weighting

Prompt Injection Red-Team Agents nlp

PDF Code

benchmark arXiv Mar 11, 2026 · 10w ago

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

Fabrizio Dimino, Bhaskarjit Sarmah, Stefano Pasquali · Domyn

Proposes risk-adjusted jailbreak evaluation framework and metric for LLMs deployed in banking and financial services

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF

attack arXiv Mar 10, 2026 · 10w ago

Compatibility at a Cost: Systematic Discovery and Exploitation of MCP Clause-Compliance Vulnerabilities

Nanzi Yang, Weiheng Bai, Kangjie Lu · University of Minnesota

Systematically exploits MCP SDK non-compliance vulnerabilities to launch silent prompt injection and DoS attacks against LLM agents

Insecure Plugin Design Prompt Injection Vulnerability Discovery nlp

PDF

benchmark arXiv Mar 10, 2026 · 10w ago

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Harry Owiredu-Ashley

Automated multi-turn red-teaming framework measures LLM guardrail degradation as continuous compliance trajectories, not binary jailbreak events

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF

survey arXiv Feb 24, 2026 · 12w ago

A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications

Shruti Srivastava, Kiranmayee Janardhan, Shaurya Jauhari · Infosys Limited

Surveys algorithmic red teaming methodologies for AI systems, covering automated attack tools, limitations, and future research gaps

Input Manipulation Attack Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF

Loading more papers…

Latest papers

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

Training a General Purpose Automated Red Teaming Model

MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Emergent Formal Verification: How an Autonomous AI Ecosystem Independently Discovered SMT-Based Safety Across Six Domains

Automated Membership Inference Attacks: Discovering MIA Signal Computations using LLM Agents

VeriGrey: Greybox Agent Validation

Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare

When Scanners Lie: Evaluator Instability in LLM Red-Teaming

PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

Compatibility at a Cost: Systematic Discovery and Exploitation of MCP Clause-Compliance Vulnerabilities

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue