New axis

LLMs for Security

Research on using LLMs as offensive or defensive security tools — autonomous pentesters, exploit synthesis, LLM-driven fuzzing, SOC copilots, RE assistants. Orthogonal to OWASP: a paper here is about using LLMs to find or fix vulnerabilities, not about attacking LLMs themselves.

100 papers tagged
Browse all →

Categories

LS01 Vulnerability Discovery
10

LLM-driven bug finding in source code, binaries, or systems

Top paper

Takedown: How It's Done in Modern Coding Agent Exploits

3 citations

LS02 Exploit Generation
5

LLMs synthesizing PoCs, exploits, and payloads

Top paper

Red Teaming Program Repair Agents: When Correct Patches can Hide Vulnerabilities

2 citations

LS03 Reconnaissance & OSINT
1

Target enumeration and attack-surface mapping with LLMs

Top paper

Red-Teaming Coding Agents from a Tool-Invocation Perspective: An Empirical Security Assessment

LS04 Patch & Remediation
2

LLM-suggested security fixes and program repair

Top paper

Emergent Formal Verification: How an Autonomous AI Ecosystem Independently Discovered SMT-Based Safety Across Six Domains

LS05 Triage & Prioritization
2

LLMs for severity scoring, dedup, advisory drafting

Top paper

Red-Teaming Claude Opus and ChatGPT-based Security Advisors for Trusted Execution Environments

1 citations

LS06 Red-Team Agents
63

Autonomous LLM offensive agents (PentestGPT-class)

Top paper

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

9 citations

LS07 Blue-Team Agents
22

Defensive LLM agents — SOC copilots, IR, threat hunting

Top paper

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

2 citations

LS08 Reverse Engineering
1

LLM-assisted RE, decompilation, malware analysis

Top paper

NeuroDeX: Unlocking Diverse Support in Decompiling Deep Neural Network Executables

LS09 Fuzzing & Test Generation
2

LLM-driven fuzzers, harness and seed generation

Top paper

HFuzzer: Testing Large Language Models for Package Hallucinations via Phrase-based Fuzzing

1 citations

LS10 Benchmarks & Evaluation
53

Datasets and benchmarks for LLM4SEC

Top paper

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

4 citations

tool arXiv Apr 30, 2026 · 21d ago

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Yanting Wang, Chenlong Yin, Ying Chen et al. · The Pennsylvania State University

Efficient red-teaming framework achieving 2-7x speedup for optimization-based prompt injection and knowledge corruption attacks on long-context LLMs

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp
PDF Code
defense arXiv Apr 27, 2026 · 24d ago

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Jiaqi Li, Yang Zhao, Bin Sun et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +3 more

Self-play security training framework teaching AI agents to detect prompt injection, memory poisoning, and supply-chain attacks via role alternation

AI Supply Chain Attacks Prompt Injection Excessive Agency Blue-Team Agents nlp
PDF
defense arXiv Apr 25, 2026 · 26d ago

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

Yuandao Cai, Wensheng Tang, Cheng Wen et al. · The Hong Kong University of Science and Technology · Xidian University

Taint tracking framework that detects malicious data flows in LLM agents from untrusted sources to privileged actions

Prompt Injection Insecure Plugin Design Blue-Team Agents nlp
PDF
attack arXiv Apr 24, 2026 · 27d ago

Training a General Purpose Automated Red Teaming Model

Aishwarya Padmakumar, Leon Derczynski, Traian Rebedea et al. · NVIDIA

Trains general-purpose LLM red teaming models that generalize to arbitrary adversarial goals without pre-existing evaluators

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp
PDF
tool arXiv Apr 23, 2026 · 28d ago

MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks

Run Hao, Zhuoran Tan · Aarhus University · University of Glasgow

Security testing framework for MCP tool servers detecting developer pitfalls through static analysis and trace-based validation

AI Supply Chain Attacks Insecure Plugin Design Prompt Injection Benchmarks & Evaluation Blue-Team Agents multimodalnlp
PDF
attack arXiv Apr 20, 2026 · 4w ago

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Yuan Fang, Yiming Luo, Aimin Zhou et al. · East China Normal University · Shanghai Innovation Institute

Automated red-teaming framework generating diverse toxic datasets via inverted constitutional AI to test LLM safety mechanisms

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp
PDF Code
benchmark arXiv Apr 8, 2026 · 6w ago

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang et al. · CyCraft · National Taiwan University

Benchmark evaluating LLM safety guardrails on multi-step agent tool-calling trajectories across 12 risk categories including prompt injection

Prompt Injection Insecure Plugin Design Excessive Agency Benchmarks & Evaluation Blue-Team Agents nlp
PDF
defense arXiv Mar 26, 2026 · 8w ago

Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

Hieu Xuan Le, Benjamin Goh, Quy Anh Tang · GovTech

Lightweight LLM judges with structured reasoning detect jailbreaks and prompt injections in production chatbots under strict latency constraints

Prompt Injection Blue-Team Agents nlp
PDF
attack arXiv Mar 26, 2026 · 8w ago

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

Ron Litvak · Columbia University

System prompt engineering creates exploitable phishing detection vulnerabilities in LLM email agents despite strong benchmark performance

Input Manipulation Attack Prompt Injection Excessive Agency Blue-Team Agents Benchmarks & Evaluation nlp
PDF
attack arXiv Mar 25, 2026 · 8w ago

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov et al. · MATS · ELLIS Institute Tübingen +3 more

AI agent autonomously discovers novel white-box jailbreak attacks outperforming 30+ existing methods with 100% ASR on target models

Input Manipulation Attack Prompt Injection Red-Team Agents Exploit Generation nlp
PDF Code
defense arXiv Mar 22, 2026 · 8w ago

Emergent Formal Verification: How an Autonomous AI Ecosystem Independently Discovered SMT-Based Safety Across Six Domains

Octavian Untila · Aisophical SRL

Autonomous AI system independently discovers SMT-based formal verification for AI safety across six domains with 100% accuracy

Output Integrity Attack Insecure Plugin Design Excessive Agency Vulnerability Discovery Patch & Remediation nlpmultimodal
PDF
attack arXiv Mar 19, 2026 · 9w ago

Automated Membership Inference Attacks: Discovering MIA Signal Computations using LLM Agents

Toan Tran, Olivera Kotevska, Li Xiong · Emory University · Oak Ridge National Laboratory

LLM-agent framework that automatically discovers novel membership inference attack strategies, achieving 0.18 AUC improvement over existing MIAs

Membership Inference Attack Vulnerability Discovery Red-Team Agents
PDF

Most cited

See all →
attack arXiv Oct 6, 2025 · Oct 2025

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov et al. · University of Maryland · Meta

Trains RL attacker from scratch to perform prompt injection, achieving 98% ASR against GPT-4o and bypassing Instruction Hierarchy and SecAlign defenses

Prompt Injection Red-Team Agents nlp
9 citations PDF Code
attack arXiv Sep 25, 2025 · Sep 2025

Automatic Red Teaming LLM-based Agents with Model Context Protocol Tools

Ping He, Changjiang Li, Binbin Zhao et al. · Zhejiang University · Palo Alto Networks

Automates generation of malicious MCP tools that manipulate LLM agent behavior while evading current detection mechanisms

Insecure Plugin Design Prompt Injection Red-Team Agents nlp
6 citations PDF
tool arXiv Jan 4, 2026 · Jan 2026

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

Xin Wang, Yunhao Chen, Juncheng Li et al. · Shanghai Artificial Intelligence Laboratory

Open-source MLLM red-teaming framework integrating 37 attacks, revealing up to 49% ASR on frontier models including GPT-5.2 and Claude 4.5

Input Manipulation Attack Prompt Injection Red-Team Agents Benchmarks & Evaluation nlpmultimodalvision
4 citations 1 influentialPDF Code
attack arXiv Sep 29, 2025 · Sep 2025

Takedown: How It's Done in Modern Coding Agent Exploits

Eunkyu Lee, Donghyeon Kim, Wonyoung Kim et al. · KAIST

Exploits 15 vulnerabilities in 8 real coding agents via insecure tool design, achieving command execution and data exfiltration without user interaction

Insecure Plugin Design Excessive Agency Red-Team Agents Vulnerability Discovery nlp
3 citations PDF
attack arXiv Oct 31, 2025 · Oct 2025

Diffusion LLMs are Natural Adversaries for any LLM

David Lüdke, Tom Wollschläger, Paul Ungermann et al. · Technical University of Munich

Uses Diffusion LLMs as amortized jailbreak generators, producing low-perplexity transferable harmful prompts against black-box and proprietary LLMs

Prompt Injection Red-Team Agents nlpgenerative
3 citations PDF Code
attack arXiv Nov 20, 2025 · Nov 2025

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

Yige Li, Zhe Li, Wei Zhao et al. · Singapore Management University · The University of Melbourne +1 more

Automates LLM backdoor injection via LLM agents generating semantic triggers, achieving 90%+ success rate while evading state-of-the-art defenses

Model Poisoning Training Data Poisoning Red-Team Agents nlp
2 citations PDF Code