ML Security Papers

Latest papers

22 papers

defense arXiv Apr 27, 2026 · 24d ago

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Jiaqi Li, Yang Zhao, Bin Sun et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +3 more

Self-play security training framework teaching AI agents to detect prompt injection, memory poisoning, and supply-chain attacks via role alternation

AI Supply Chain Attacks Prompt Injection Excessive Agency Blue-Team Agents nlp

PDF

defense arXiv Apr 25, 2026 · 26d ago

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

Yuandao Cai, Wensheng Tang, Cheng Wen et al. · The Hong Kong University of Science and Technology · Xidian University

Taint tracking framework that detects malicious data flows in LLM agents from untrusted sources to privileged actions

Prompt Injection Insecure Plugin Design Blue-Team Agents nlp

PDF

tool arXiv Apr 23, 2026 · 28d ago

MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks

Run Hao, Zhuoran Tan · Aarhus University · University of Glasgow

Security testing framework for MCP tool servers detecting developer pitfalls through static analysis and trace-based validation

AI Supply Chain Attacks Insecure Plugin Design Prompt Injection Benchmarks & Evaluation Blue-Team Agents multimodalnlp

PDF

benchmark arXiv Apr 8, 2026 · 6w ago

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang et al. · CyCraft · National Taiwan University

Benchmark evaluating LLM safety guardrails on multi-step agent tool-calling trajectories across 12 risk categories including prompt injection

Prompt Injection Insecure Plugin Design Excessive Agency Benchmarks & Evaluation Blue-Team Agents nlp

PDF

defense arXiv Mar 26, 2026 · 8w ago

Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

Hieu Xuan Le, Benjamin Goh, Quy Anh Tang · GovTech

Lightweight LLM judges with structured reasoning detect jailbreaks and prompt injections in production chatbots under strict latency constraints

Prompt Injection Blue-Team Agents nlp

PDF

attack arXiv Mar 26, 2026 · 8w ago

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

Ron Litvak · Columbia University

System prompt engineering creates exploitable phishing detection vulnerabilities in LLM email agents despite strong benchmark performance

Input Manipulation Attack Prompt Injection Excessive Agency Blue-Team Agents Benchmarks & Evaluation nlp

PDF

defense arXiv Mar 18, 2026 · 9w ago

Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare

Saikat Maiti · Commure · nFactor Technologies

Zero-trust architecture for healthcare AI agents using kernel isolation, credential proxies, network policies, and prompt integrity framework

AI Supply Chain Attacks Prompt Injection Excessive Agency Blue-Team Agents nlp

PDF

defense arXiv Feb 5, 2026 · Feb 2026

Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

Zhenxiong Yu, Zhi Yang, Zhiheng Jin et al. · SUFE · NUS +5 more

Event-driven LLM agent defense that selectively triggers hierarchical screening against prompt injection and multi-stage agent attacks

Prompt Injection Excessive Agency Blue-Team Agents Benchmarks & Evaluation nlp

PDF Code

defense arXiv Jan 27, 2026 · Jan 2026

RvB: Automating AI System Hardening via Iterative Red-Blue Games

Lige Huang, Zicheng Liu, Jie Zhang et al. · Shanghai Artificial Intelligence Laboratory · Institute of Information Engineering +1 more

Automates LLM jailbreak guardrail hardening via iterative red-blue adversarial game without model parameter updates

Prompt Injection Red-Team Agents Patch & Remediation Blue-Team Agents nlp

PDF

defense arXiv Jan 15, 2026 · Jan 2026

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Yutao Mou, Zhangchi Xue, Lijun Li et al. · Peking University · Shanghai Artificial Intelligence Laboratory

Proactive step-level guardrail for LLM agent tool calls defends against malicious requests and prompt injection, cutting harmful invocations by 65%

Insecure Plugin Design Prompt Injection Blue-Team Agents nlp

2 citations PDF

defense arXiv Jan 12, 2026 · Jan 2026

SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations

Mohammed Himayath Ali, Mohammed Aqib Abdullah, Mohammed Mudassir Uddin et al. · Computer Science Department

Defends SOC-deployed LLMs against prompt injection in security artifacts using Constitutional AI, adaptive guardrails, and DPO unlearning

Prompt Injection Blue-Team Agents nlp

PDF

defense arXiv Dec 24, 2025 · Dec 2025

AegisAgent: An Autonomous Defense Agent Against Prompt Injection Attacks in LLM-HARs

Yihan Wang, Huanqi Yang, Shantanu Pal et al. · City University of Hong Kong · Deakin University

Autonomous agent defense against prompt injection in LLM-based wearable HAR systems, reducing attack success rate by 30%

Prompt Injection Blue-Team Agents nlpmultimodal

1 citations PDF

defense arXiv Dec 19, 2025 · Dec 2025

Verifiability-First Agents: Provable Observability and Lightweight Audit Agents for Controlling Autonomous LLM Systems

Abhivansh Gupta · Indian Institute of Technology

Proposes cryptographic attestation architecture and benchmark to detect and remediate misaligned autonomous LLM agents

Excessive Agency Prompt Injection Blue-Team Agents Benchmarks & Evaluation nlp

PDF

defense arXiv Dec 15, 2025 · Dec 2025

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Asa Cooper Stickland, Jan Michelfeit, Arathi Mani et al. · UK AI Security Institute

Stress-tests asynchronous monitors for misaligned LLM coding agents via iterative red-blue team games in realistic SWE environments

Excessive Agency Prompt Injection Blue-Team Agents Benchmarks & Evaluation nlp

1 citations PDF Code

tool arXiv Oct 31, 2025 · Oct 2025

From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection

Mengfei Liang, Yiting Qu, Yukun Jiang et al. · CISPA Helmholtz Center for Information Security

Multi-agent forensic framework with LLM debate and memory module achieves 97% accuracy on AI-generated image detection

Output Integrity Attack Blue-Team Agents visionnlp

1 citations PDF

tool arXiv Oct 22, 2025 · Oct 2025

AegisMCP: Online Graph Intrusion Detection for Tool-Augmented LLMs on Edge Devices

Zhonghao Zhan, Amir Al Sadi, Krinos Li et al. · Imperial College London

Graph-based runtime intrusion detector for MCP tool-augmented LLM agents catching exfiltration and malicious server registration on edge hardware

Insecure Plugin Design Blue-Team Agents graphnlp

PDF

defense arXiv Oct 20, 2025 · Oct 2025

BlueCodeAgent: A Blue Teaming Agent Enabled by Automated Red Teaming for CodeGen AI

Chengquan Guo, Yuzhou Nie, Chulin Xie et al. · University of Chicago · UC Santa Barbara +3 more

Blue teaming agent for CodeGen LLMs using automated red teaming to detect malicious instructions and vulnerable code outputs

Prompt Injection Blue-Team Agents Vulnerability Discovery Red-Team Agents nlp

PDF

defense arXiv Sep 16, 2025 · Sep 2025

A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks

S M Asif Hossain, Ruksat Khan Shayoni, Mohd Ruhul Ameen et al. · Wichita State University · Marshall University +3 more

Multi-agent LLM defense pipeline reduces prompt injection attack success rate from 30% to 0% across 400 attack instances

Prompt Injection Blue-Team Agents nlp

PDF

defense arXiv Sep 9, 2025 · Sep 2025

AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents

Haitao Hu, Peng Chen, Yanpeng Zhao et al. · ShanghaiTech University

Defends LLM computer-use agents from harmful autonomous tool executions via real-time operation interception and context-aware security auditing

Excessive Agency Prompt Injection Blue-Team Agents Benchmarks & Evaluation nlp

PDF

benchmark arXiv Aug 26, 2025 · Aug 2025

Reliable Weak-to-Strong Monitoring of LLM Agents

Neil Kale, Chen Bo Calvin Zhang, Kevin Zhu et al. · Scale AI · Carnegie Mellon University +1 more

Stress-tests LLM agent monitors via red-teaming and proposes hybrid scaffolding enabling weak-to-strong reliable monitoring

Excessive Agency Prompt Injection Blue-Team Agents Benchmarks & Evaluation nlp

PDF

Loading more papers…

Latest papers

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare

Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

RvB: Automating AI System Hardening via Iterative Red-Blue Games

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations

AegisAgent: An Autonomous Defense Agent Against Prompt Injection Attacks in LLM-HARs

Verifiability-First Agents: Provable Observability and Lightweight Audit Agents for Controlling Autonomous LLM Systems

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection

AegisMCP: Online Graph Intrusion Detection for Tool-Augmented LLMs on Edge Devices

BlueCodeAgent: A Blue Teaming Agent Enabled by Automated Red Teaming for CodeGen AI

A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks

AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents

Reliable Weak-to-Strong Monitoring of LLM Agents

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue