Latest papers

21 papers
defense arXiv Apr 6, 2026 · 2d ago

ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

Zhuowen Yuan, Zhaorun Chen, Zhen Xiang et al. · University of Illinois Urbana-Champaign · Virtue AI +6 more

Network-level guardrail detecting supply-chain poisoning in LLM agent MCP tools via MITM proxy monitoring network behaviors

AI Supply Chain Attacks Insecure Plugin Design nlp
PDF
benchmark arXiv Apr 1, 2026 · 7d ago

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models

Weidi Luo, Xiaofei Wen, Tenghao Huang et al. · University of Georgia · University of California +3 more

Benchmark and guardrail for detecting jailbreak attacks that bypass LLM safety alignment in food safety domain

Prompt Injection nlp
PDF Code
defense arXiv Mar 12, 2026 · 27d ago

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Zhiyu Xue, Zimo Qi, Guangliang Liu et al. · University of California · Johns Hopkins University +2 more

Analyzes refusal trigger mechanisms in LLM safety alignment to reduce overrefusal while maintaining jailbreak defenses

Prompt Injection nlp
PDF
defense arXiv Feb 7, 2026 · 8w ago

AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management

Ruoyao Wen, Hao Li, Chaowei Xiao et al. · Washington University in St. Louis · Johns Hopkins University

Defends LLM agents against indirect prompt injection using OS-inspired hierarchical memory isolation and schema-validated context boundaries

Prompt Injection Excessive Agency nlp
PDF Code
benchmark arXiv Feb 3, 2026 · 9w ago

AgentDyn: A Dynamic Open-Ended Benchmark for Evaluating Prompt Injection Attacks of Real-World Agent Security System

Hao Li, Ruoyao Wen, Shanghao Shi et al. · Washington University in St. Louis · Johns Hopkins University

New dynamic benchmark exposing that all existing indirect prompt injection defenses fail real-world agent deployment requirements

Prompt Injection nlp
PDF Code
attack arXiv Jan 29, 2026 · 9w ago

ReasoningBomb: A Stealthy Denial-of-Service Attack by Inducing Pathologically Long Reasoning in Large Reasoning Models

Xiaogeng Liu, Xinyan Wang, Yechao Zhang et al. · Johns Hopkins University · NVIDIA +4 more

RL-trained attacker generates short natural prompts that force LRMs into pathologically long reasoning, achieving 286x amplification and >98% detection bypass

Model Denial of Service nlpreinforcement-learning
PDF
defense arXiv Jan 15, 2026 · 11w ago

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

Hao Li, Yankai Yang, G. Edward Suh et al. · Washington University in St. Louis · University of Wisconsin–Madison +2 more

Defends LLM agents against indirect prompt injection using structured reasoning to detect conflicting injected instructions

Prompt Injection nlp
1 citations PDF Code
benchmark arXiv Jan 12, 2026 · 12w ago

Defenses Against Prompt Attacks Learn Surface Heuristics

Shawn Li, Chenxiao Yu, Zhiyu Ni et al. · University of Southern California · University of California +3 more

Exposes three shortcut biases in LLM prompt-injection defenses: position, token-trigger, and topic generalization—causing up to 90% false rejection rates

Prompt Injection nlp
PDF Code
benchmark arXiv Dec 25, 2025 · Dec 2025

The Deepfake Detective: Interpreting Neural Forensics Through Sparse Features and Manifolds

Subramanyam Sahoo, Jared Junkin · University of California · Johns Hopkins University

Interprets deepfake detector internals using sparse autoencoders and forensic manifold analysis on a 2B-parameter VLM

Output Integrity Attack visionmultimodal
PDF Code
defense arXiv Nov 30, 2025 · Nov 2025

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

Mintong Kang, Chong Xiang, Sanjay Kariyappa et al. · NVIDIA · University of Illinois Urbana-Champaign +1 more

Defends LLM agents against indirect prompt injection by analyzing whether the model intends to follow untrusted instructions, cutting attack success from 100% to 8.5%

Prompt Injection nlp
1 citations PDF
defense arXiv Nov 8, 2025 · Nov 2025

CGCE: Classifier-Guided Concept Erasure in Generative Models

Viet Nguyen, Vishal M. Patel · Johns Hopkins University

Plug-and-play classifier-guided guardrail for T2I/T2V models that resists red-teaming attacks bypassing erased unsafe concepts

Input Manipulation Attack Prompt Injection visiongenerativemultimodal
PDF
benchmark arXiv Oct 8, 2025 · Oct 2025

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Weidi Luo, Qiming Zhang, Tianyu Lu et al. · University of Georgia · University of Wisconsin–Madison +6 more

Benchmarks LLM-powered agents' ability to execute end-to-end enterprise intrusions aligned with MITRE ATT&CK TTPs

Excessive Agency Prompt Injection nlpmultimodal
4 citations PDF Code
attack arXiv Oct 6, 2025 · Oct 2025

AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling

Xiaogeng Liu, Chaowei Xiao · Johns Hopkins University

Scales AutoDAN-Turbo jailbreaks via Best-of-N and Beam Search strategy search, boosting LLM attack success by up to 15.6 pp

Prompt Injection nlp
PDF Code
attack arXiv Oct 1, 2025 · Oct 2025

Backdoor Attacks Against Speech Language Models

Alexandrine Fortier, Thomas Thebaud, Jesús Villalba et al. · École de technologie supérieure · Johns Hopkins University

First systematic backdoor attack study on speech LLMs, achieving 90–99% success across four encoders, with component-level propagation analysis

Model Poisoning Transfer Learning Attack audiomultimodalnlp
1 citations PDF
attack arXiv Sep 30, 2025 · Sep 2025

CHAI: Command Hijacking against embodied AI

Luis Burbano, Diego Ortiz, Qi Sun et al. · University of California · Johns Hopkins University

Embeds deceptive adversarial text signs in physical environments to hijack LVLM-controlled robotic vehicles and drones

Input Manipulation Attack Prompt Injection visionmultimodalnlp
PDF
defense arXiv Sep 25, 2025 · Sep 2025

HuLA: Prosody-Aware Anti-Spoofing with Multi-Task Learning for Expressive and Emotional Synthetic Speech

Aurosweta Mahapatra, Ismail Rasim Ulgen, Berrak Sisman · Johns Hopkins University

Proposes prosody-aware multi-task SSL framework to detect expressive and emotional synthetic speech via F0 and voiced/unvoiced cues

Output Integrity Attack audio
PDF
attack arXiv Sep 24, 2025 · Sep 2025

Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models

Zhifang Zhang, Jiahan Zhang, Shengjie Zhou et al. · Southeast University · Johns Hopkins University +3 more

Proposes Proxy Targeted Attack to craft generalizable, anomaly-evasive adversarial examples against multimodal encoders like ImageBind

Input Manipulation Attack visionmultimodalnlp
2 citations PDF
defense arXiv Sep 4, 2025 · Sep 2025

NE-PADD: Leveraging Named Entity Knowledge for Robust Partial Audio Deepfake Detection via Attention Aggregation

Huhong Xian, Rui Liu, Berrak Sisman et al. · Inner Mongolia University · Johns Hopkins University +1 more

Detects frame-level synthetic speech segments in partial audio deepfakes using named entity recognition and attention aggregation

Output Integrity Attack audio
PDF Code
attack arXiv Aug 12, 2025 · Aug 2025

Multi-Target Backdoor Attacks Against Speaker Recognition

Alexandrine Fortier, Sonal Joshi, Thomas Thebaud et al. · École de technologie supérieure · Johns Hopkins University

Multi-target backdoor attack on speaker recognition using clicking-sound triggers, poisoning up to 50 speakers at 95% success rate

Model Poisoning audio
PDF
defense arXiv Aug 4, 2025 · Aug 2025

Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling

Xuanjun Chen, Shih-Peng Cheng, Jiawei Du et al. · National Taiwan University · Johns Hopkins University +1 more

Novel hierarchical boundary modeling network that temporally localizes manipulated segments in audio-visual deepfake content

Output Integrity Attack multimodalaudiovision
PDF
Loading more papers…