New axis

LLMs for Security

Research on using LLMs as offensive or defensive security tools — autonomous pentesters, exploit synthesis, LLM-driven fuzzing, SOC copilots, RE assistants. Orthogonal to OWASP: a paper here is about using LLMs to find or fix vulnerabilities, not about attacking LLMs themselves.

100 papers tagged

Browse all →

Recent

See all →

tool arXiv Apr 30, 2026 · 21d ago

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Yanting Wang, Chenlong Yin, Ying Chen et al. · The Pennsylvania State University

Efficient red-teaming framework achieving 2-7x speedup for optimization-based prompt injection and knowledge corruption attacks on long-context LLMs

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

defense arXiv Apr 27, 2026 · 24d ago

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Jiaqi Li, Yang Zhao, Bin Sun et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +3 more

Self-play security training framework teaching AI agents to detect prompt injection, memory poisoning, and supply-chain attacks via role alternation

AI Supply Chain Attacks Prompt Injection Excessive Agency Blue-Team Agents nlp

PDF

defense arXiv Apr 25, 2026 · 26d ago

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

Yuandao Cai, Wensheng Tang, Cheng Wen et al. · The Hong Kong University of Science and Technology · Xidian University

Taint tracking framework that detects malicious data flows in LLM agents from untrusted sources to privileged actions

Prompt Injection Insecure Plugin Design Blue-Team Agents nlp

PDF

attack arXiv Apr 24, 2026 · 27d ago

Training a General Purpose Automated Red Teaming Model

Aishwarya Padmakumar, Leon Derczynski, Traian Rebedea et al. · NVIDIA

Trains general-purpose LLM red teaming models that generalize to arbitrary adversarial goals without pre-existing evaluators

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF

tool arXiv Apr 23, 2026 · 28d ago

MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks

Run Hao, Zhuoran Tan · Aarhus University · University of Glasgow

Security testing framework for MCP tool servers detecting developer pitfalls through static analysis and trace-based validation

AI Supply Chain Attacks Insecure Plugin Design Prompt Injection Benchmarks & Evaluation Blue-Team Agents multimodalnlp

PDF

attack arXiv Apr 20, 2026 · 4w ago

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Yuan Fang, Yiming Luo, Aimin Zhou et al. · East China Normal University · Shanghai Innovation Institute

Automated red-teaming framework generating diverse toxic datasets via inverted constitutional AI to test LLM safety mechanisms

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

benchmark arXiv Apr 8, 2026 · 6w ago

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang et al. · CyCraft · National Taiwan University

Benchmark evaluating LLM safety guardrails on multi-step agent tool-calling trajectories across 12 risk categories including prompt injection

Prompt Injection Insecure Plugin Design Excessive Agency Benchmarks & Evaluation Blue-Team Agents nlp

PDF

defense arXiv Mar 26, 2026 · 8w ago

Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

Hieu Xuan Le, Benjamin Goh, Quy Anh Tang · GovTech

Lightweight LLM judges with structured reasoning detect jailbreaks and prompt injections in production chatbots under strict latency constraints

Prompt Injection Blue-Team Agents nlp

PDF

attack arXiv Mar 26, 2026 · 8w ago

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

Ron Litvak · Columbia University

System prompt engineering creates exploitable phishing detection vulnerabilities in LLM email agents despite strong benchmark performance

Input Manipulation Attack Prompt Injection Excessive Agency Blue-Team Agents Benchmarks & Evaluation nlp

PDF

attack arXiv Mar 25, 2026 · 8w ago

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov et al. · MATS · ELLIS Institute Tübingen +3 more

AI agent autonomously discovers novel white-box jailbreak attacks outperforming 30+ existing methods with 100% ASR on target models

Input Manipulation Attack Prompt Injection Red-Team Agents Exploit Generation nlp

PDF Code

defense arXiv Mar 22, 2026 · 8w ago

Emergent Formal Verification: How an Autonomous AI Ecosystem Independently Discovered SMT-Based Safety Across Six Domains

Octavian Untila · Aisophical SRL

Autonomous AI system independently discovers SMT-based formal verification for AI safety across six domains with 100% accuracy

Output Integrity Attack Insecure Plugin Design Excessive Agency Vulnerability Discovery Patch & Remediation nlpmultimodal

PDF

attack arXiv Mar 19, 2026 · 9w ago

Automated Membership Inference Attacks: Discovering MIA Signal Computations using LLM Agents

Toan Tran, Olivera Kotevska, Li Xiong · Emory University · Oak Ridge National Laboratory

LLM-agent framework that automatically discovers novel membership inference attack strategies, achieving 0.18 AUC improvement over existing MIAs

Membership Inference Attack Vulnerability Discovery Red-Team Agents

PDF

Most cited

See all →

attack arXiv Oct 6, 2025 · Oct 2025

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov et al. · University of Maryland · Meta

Trains RL attacker from scratch to perform prompt injection, achieving 98% ASR against GPT-4o and bypassing Instruction Hierarchy and SecAlign defenses

Prompt Injection Red-Team Agents nlp

9 citations PDF Code

attack arXiv Sep 25, 2025 · Sep 2025

Automatic Red Teaming LLM-based Agents with Model Context Protocol Tools

Ping He, Changjiang Li, Binbin Zhao et al. · Zhejiang University · Palo Alto Networks

Automates generation of malicious MCP tools that manipulate LLM agent behavior while evading current detection mechanisms

Insecure Plugin Design Prompt Injection Red-Team Agents nlp

6 citations PDF

tool arXiv Jan 4, 2026 · Jan 2026

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

Xin Wang, Yunhao Chen, Juncheng Li et al. · Shanghai Artificial Intelligence Laboratory

Open-source MLLM red-teaming framework integrating 37 attacks, revealing up to 49% ASR on frontier models including GPT-5.2 and Claude 4.5

Input Manipulation Attack Prompt Injection Red-Team Agents Benchmarks & Evaluation nlpmultimodalvision

4 citations 1 influentialPDF Code

attack arXiv Sep 29, 2025 · Sep 2025

Takedown: How It's Done in Modern Coding Agent Exploits

Eunkyu Lee, Donghyeon Kim, Wonyoung Kim et al. · KAIST

Exploits 15 vulnerabilities in 8 real coding agents via insecure tool design, achieving command execution and data exfiltration without user interaction

Insecure Plugin Design Excessive Agency Red-Team Agents Vulnerability Discovery nlp

3 citations PDF

attack arXiv Oct 31, 2025 · Oct 2025

Diffusion LLMs are Natural Adversaries for any LLM

David Lüdke, Tom Wollschläger, Paul Ungermann et al. · Technical University of Munich

Uses Diffusion LLMs as amortized jailbreak generators, producing low-perplexity transferable harmful prompts against black-box and proprietary LLMs

Prompt Injection Red-Team Agents nlpgenerative

3 citations PDF Code

attack arXiv Nov 20, 2025 · Nov 2025

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

Yige Li, Zhe Li, Wei Zhao et al. · Singapore Management University · The University of Melbourne +1 more

Automates LLM backdoor injection via LLM agents generating semantic triggers, achieving 90%+ success rate while evading state-of-the-art defenses

Model Poisoning Training Data Poisoning Red-Team Agents nlp

2 citations PDF Code

LLMs for Security

Categories

Recent

Most cited