ML Security Papers

Latest papers

1,367 papers

survey arXiv Apr 30, 2026 · 21d ago

Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study

Luyao Xu, Xiang Chen · Nantong University · Nanjing University

Layered security review of LLM agent frameworks covering prompt injection, tool misuse, state persistence attacks, and ecosystem vulnerabilities

Prompt Injection Insecure Plugin Design Excessive Agency nlp

PDF

tool arXiv Apr 30, 2026 · 21d ago

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

Hongliang Liu, Tung-Ling Li, Yuhao Wu · Palo Alto Networks

Two-pass perturbation probing identifies 50-neuron safety refusal circuits in aligned LLMs, enabling precision ablation interventions

Prompt Injection nlp

PDF

defense arXiv Apr 30, 2026 · 21d ago

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

Bowen Sun, Chaozhuo Li, Yaodong Yang et al. · Johns Hopkins University · Microsoft Research Asia +2 more

Dual-encoder defense that clusters fragmented malicious prompts across anonymous LLM requests using asymmetric contrastive learning

Prompt Injection nlp

PDF

tool arXiv Apr 30, 2026 · 21d ago

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Yanting Wang, Chenlong Yin, Ying Chen et al. · The Pennsylvania State University

Efficient red-teaming framework achieving 2-7x speedup for optimization-based prompt injection and knowledge corruption attacks on long-context LLMs

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

defense arXiv Apr 30, 2026 · 21d ago

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

Jona te Lintelo, Lichao Wu, Marina Krček et al. · Radboud University · University of Bristol +2 more

Reconfigures MoE LLM safety behavior by steering expert routing at inference time without retraining, defending against jailbreaks

Prompt Injection nlp

PDF

benchmark arXiv Apr 29, 2026 · 22d ago

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

Wenhao Lan, Shan Li, Junbin Yang et al. · University of Chinese Academy of Sciences · Inner Mongolia University of Technology +1 more

Mechanistic analysis showing adversarial fine-tuning reorganizes LLM refusal representations across layers while navigating robustness-utility tradeoffs

Prompt Injection nlp

PDF

benchmark arXiv Apr 29, 2026 · 22d ago

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

Matteo Leonesi, Francesco Belardinelli, Flavio Corradini et al. · University of Camerino · Imperial College London

Detects LLM alignment faking via tool selection mismatches between monitored and unmonitored contexts in enterprise IT scenarios

Prompt Injection Excessive Agency nlp

PDF Code

defense arXiv Apr 29, 2026 · 22d ago

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

Yuan Xin, Yixuan Weng, Minjun Zhu et al. · CISPA · Westlake University +3 more

GAN-inspired co-evolutionary framework training attack generators and defenders to protect LLM review systems from hidden prompt injection

Prompt Injection nlp

PDF

benchmark arXiv Apr 29, 2026 · 22d ago

Indirect Prompt Injection in the Wild: An Empirical Study of Prevalence, Techniques, and Objectives

Soheil Khodayari, Xuenan Zhang, Bhupendra Acharya et al. · Independent Researcher · CISPA Helmholtz Center for Information Security +1 more

Discovers 15.3K real-world indirect prompt injections across 1.2B URLs targeting LLM crawlers, agents, and automation systems

Prompt Injection nlpmultimodal

PDF

defense arXiv Apr 28, 2026 · 23d ago

SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

Mengyao Du, Han Fang, Haokai Ma et al. · National University of Defense Technology · University of Science and Technology of China +2 more

Lightweight detector that identifies prompt injection attacks in web agent screenshots using visual gradient analysis and text recovery

Prompt Injection Excessive Agency multimodalnlp

PDF

defense arXiv Apr 28, 2026 · 23d ago

Cross-Lingual Jailbreak Detection via Semantic Codebooks

Shirin Alanova, Bogdan Minko, Sabrina Sadiekh et al. · ITMO University · HiveTraceLab

Training-free jailbreak detector using multilingual embeddings matched against English codebook, effective on templates but degrades on diverse attacks

Prompt Injection nlpmultimodal

PDF

attack arXiv Apr 28, 2026 · 23d ago

One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

Ravikumar Balakrishnan, Sanket Mendapara · Cisco Systems

Adversarial visual perturbations that bypass VLM safety filters via embedding-guided typographic optimization, exploiting both readability and alignment weaknesses

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF

attack arXiv Apr 28, 2026 · 23d ago

Test-Time Safety Alignment

Baturay Saglam, Dionysis Kalogerias · Yale University

Gradient-based embedding optimization that bypasses LLM safety alignment to neutralize refusals on harmful queries

Input Manipulation Attack Prompt Injection nlp

PDF

benchmark arXiv Apr 27, 2026 · 24d ago

A Comparative Evaluation of AI Agent Security Guardrails

Qi Li, Jiu Li, Pingtao Wei et al. · Beijing Caizhi Tech

Benchmarks four commercial AI agent security guardrails on detecting prompt injection, instruction override, and harmful content requests

Prompt Injection Insecure Plugin Design nlp

PDF

defense arXiv Apr 27, 2026 · 24d ago

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Jiaqi Li, Yang Zhao, Bin Sun et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +3 more

Self-play security training framework teaching AI agents to detect prompt injection, memory poisoning, and supply-chain attacks via role alternation

AI Supply Chain Attacks Prompt Injection Excessive Agency Blue-Team Agents nlp

PDF

defense arXiv Apr 27, 2026 · 24d ago

AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization

Zonghao Ying, Haozheng Wang, Jiangfan Liu et al. · Beihang University · 360 AI Security Lab +1 more

OS-inspired defense framework that intercepts LLM agent tool calls and enforces privilege separation to block prompt injection attacks

Prompt Injection Excessive Agency nlp

PDF

benchmark arXiv Apr 27, 2026 · 24d ago

GAMMAF: A Common Framework for Graph-Based Anomaly Monitoring Benchmarking in LLM Multi-Agent Systems

Pablo Mateo-Torrejón, Alfonso Sánchez-Macián · University Carlos III of Madrid

Benchmarking framework for evaluating graph-based defenses against prompt injection and adversarial agents in LLM multi-agent systems

Prompt Injection Excessive Agency nlpgraph

PDF

attack arXiv Apr 27, 2026 · 24d ago

Jailbreaking Frontier Foundation Models Through Intention Deception

Xinhe Wang, Katia Sycara, Yaqi Xie · Carnegie Mellon University

Multi-turn jailbreaking attack that deceives LLM safety by simulating benign intent across conversations to elicit harmful outputs

Prompt Injection nlpmultimodal

PDF

attack arXiv Apr 27, 2026 · 24d ago

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

Miles Q. Li, Benjamin C. M. Fung, Boyang Li et al. · McGill University · Kean University +2 more

White-box jailbreak optimizing prompt embeddings directly instead of appending adversarial tokens, achieving higher success rates

Input Manipulation Attack Prompt Injection nlp

PDF

defense arXiv Apr 27, 2026 · 24d ago

Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models

Nay Myat Min, Long H. Pham, Jun Sun · Singapore Management University

Tuning-free runtime monitor detecting backdoors, jailbreaks, and prompt injection by analyzing hidden-state convergence patterns across LLM layers

Model Poisoning Prompt Injection nlp

PDF

Loading more papers…

Latest papers

Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

Indirect Prompt Injection in the Wild: An Empirical Study of Prevalence, Techniques, and Objectives

SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

Cross-Lingual Jailbreak Detection via Semantic Codebooks

One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

Test-Time Safety Alignment

A Comparative Evaluation of AI Agent Security Guardrails

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization

GAMMAF: A Common Framework for Graph-Based Anomaly Monitoring Benchmarking in LLM Multi-Agent Systems

Jailbreaking Frontier Foundation Models Through Intention Deception

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue