Latest papers

48 papers
defense arXiv Apr 12, 2026 · 5w ago

Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation

Zeqian Long, Ozgur Kara, Haotian Xue et al. · University of Illinois Urbana-Champaign · Georgia Institute of Technology

Adversarial immunization that corrupts image-to-video generation by enforcing temporal latent divergence and trajectory misalignment across frames

Input Manipulation Attack visionmultimodalgenerative
PDF Code
attack arXiv Apr 11, 2026 · 5w ago

When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

Jose Efraim Aguilar Escamilla, Haoyang Hong, Jiawei Li et al. · Oregon State University · University of Illinois Urbana-Champaign +2 more

Characterizes when reward poisoning attacks can force RL agents to adopt attacker-chosen policies in linear MDPs

Model Skewing reinforcement-learning
PDF
defense arXiv Apr 10, 2026 · 5w ago

AudioGuard: Toward Comprehensive Audio Safety Protection Across Diverse Threat Models

Mintong Kang, Chen Fang, Bo Li · University of Illinois Urbana-Champaign

Comprehensive audio safety guardrail detecting harmful sounds, voice impersonation, child voice misuse, and risky voice-content combinations

Input Manipulation Attack Output Integrity Attack Prompt Injection audionlpmultimodal
PDF
defense arXiv Apr 6, 2026 · 6w ago

ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

Zhuowen Yuan, Zhaorun Chen, Zhen Xiang et al. · University of Illinois Urbana-Champaign · Virtue AI +6 more

Network-level guardrail detecting supply-chain poisoning in LLM agent MCP tools via MITM proxy monitoring network behaviors

AI Supply Chain Attacks Insecure Plugin Design nlp
PDF
defense arXiv Apr 5, 2026 · 6w ago

CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

Siyuan Li, Zehao Liu, Xi Lin et al. · Shanghai Jiao Tong University · University of Illinois Urbana-Champaign +1 more

Multi-agent cooperative defense system that adapts across rounds to counter evolving LLM jailbreak attacks through deception and forensic analysis

Prompt Injection Excessive Agency nlp
PDF
attack arXiv Apr 3, 2026 · 6w ago

Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

Yuheng Zhang, Mingyue Huo, Minghao Zhu et al. · University of Illinois Urbana-Champaign · University of Massachusetts Amherst

Token-space adversarial attack on RLHF reward models that bypasses semantic constraints to generate nonsensical high-reward outputs

Input Manipulation Attack nlp
PDF
attack arXiv Mar 23, 2026 · 8w ago

Adversarial Vulnerabilities in Neural Operator Digital Twins: Gradient-Free Attacks on Nuclear Thermal-Hydraulic Surrogates

Samrendra Roy, Kazuma Kobayashi, Souvik Chakraborty et al. · University of Illinois Urbana-Champaign · Indian Institute of Technology Delhi +1 more

Gradient-free adversarial attacks on neural operator digital twins causing catastrophic field prediction failures through sparse physically-plausible perturbations

Input Manipulation Attack vision
PDF
benchmark arXiv Mar 11, 2026 · 10w ago

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

Xiangwen Wang, Ananth Balashankar, Varun Chandrasekaran · Google DeepMind · University of Illinois Urbana-Champaign

Scaling-law framework comparing four LLM jailbreak paradigms by FLOPs budget, finding prompt-based attacks dominate compute efficiency

Input Manipulation Attack Prompt Injection nlp
PDF
survey arXiv Mar 11, 2026 · 10w ago

The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey

Juhee Kim, Xiaoyuan Liu, Zhun Wang et al. · University of California · Seoul National University +1 more

Surveys attacks and defenses across agentic LLM systems, covering prompt injection, insecure tool use, and excessive agency risks

Prompt Injection Insecure Plugin Design Excessive Agency nlpmultimodal
PDF
benchmark arXiv Feb 13, 2026 · Feb 2026

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

Xu Li, Simon Yu, Minzhou Pan et al. · Northeastern University · Virtue AI +2 more

Benchmarks multi-turn jailbreaks in tool-using LLM agents and proposes ToolShield, a self-exploration defense reducing ASR by 30%

Prompt Injection Insecure Plugin Design nlp
PDF Code
benchmark arXiv Feb 7, 2026 · Feb 2026

Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents

Sai Puppala, Ismail Hossain, Md Jahangir Alam et al. · Southern Illinois University · University of Texas +2 more

Benchmarks LLM agent architectures across 14 attack classes, exposing authorization confusion and tool hijacking as dominant structural risks

Excessive Agency Insecure Plugin Design Prompt Injection nlp
PDF
defense arXiv Feb 4, 2026 · Feb 2026

E-Globe: Scalable $ε$-Global Verification of Neural Networks via Tight Upper Bounds and Pattern-Aware Branching

Wenting Li, Saif R. Kazi, Russell Bent et al. · University of Texas at Austin · Los Alamos National Laboratory +1 more

Branch-and-bound neural network verifier using NLP-CC upper bounds to certify or disprove adversarial robustness more efficiently than MIP methods

Input Manipulation Attack vision
PDF
attack arXiv Jan 30, 2026 · Jan 2026

Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models

Ye Yu, Haibo Jin, Yaoning Yu et al. · University of Illinois Urbana-Champaign · Boise State University

Audio narrative jailbreak using TTS achieves 98.26% success rate against safety-aligned audio-language models like Gemini 2.0 Flash

Prompt Injection audiomultimodalnlp
1 citations PDF
attack arXiv Jan 22, 2026 · Jan 2026

Connect the Dots: Knowledge Graph-Guided Crawler Attack on Retrieval-Augmented Generation Systems

Mengyu Yao, Ziqi Zhang, Ning Luo et al. · Peking University · University of Illinois Urbana-Champaign

Attacks RAG systems to steal private knowledge bases via knowledge-graph-guided adaptive queries, achieving 84.4% corpus coverage in 1,000 queries

Sensitive Information Disclosure nlp
PDF
attack arXiv Jan 21, 2026 · Jan 2026

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim et al. · University of Michigan · LG AI Research +1 more

Crafted agent chain-of-thought reasoning inflates LLM/VLM judge false positives by up to 90% across 800 web-task trajectories

Prompt Injection nlpmultimodal
1 citations PDF
benchmark arXiv Jan 19, 2026 · Jan 2026

Verifying Local Robustness of Pruned Safety-Critical Networks

Minh Le, Phuong Cao · Georgia Institute of Technology · University of Illinois Urbana-Champaign

Empirically shows pruning ratio non-linearly affects formal L∞ adversarial robustness certificates in safety-critical vision models

Input Manipulation Attack vision
PDF
attack arXiv Jan 16, 2026 · Jan 2026

Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

Kaiyu Zhou, Yongsen Zheng, Yicheng He et al. · Nanyang Technological University · University of Illinois Urbana-Champaign +2 more

Stealthy multi-turn economic DoS attack manipulates MCP tool servers to inflate LLM agent costs 658x while keeping task outputs correct

Model Denial of Service Insecure Plugin Design nlp
2 citations 1 influentialPDF
defense arXiv Jan 7, 2026 · Jan 2026

ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

Xiao Lin, Philip Li, Zhichen Zeng et al. · University of Illinois Urbana-Champaign · Visa

Defends LLMs against jailbreaks by amplifying internal layer/module/token feature discrepancies to detect attacks without training examples

Prompt Injection nlp
2 citations PDF
attack arXiv Jan 6, 2026 · Jan 2026

Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search

Devang Kulshreshtha, Hang Su, Chinmay Hegde et al. · Amazon · New York University +1 more

Attacker-LLM-free multi-turn jailbreak via lexical anchor injection achieves 97-100% ASR on GPT/Claude/Llama in ~6.4 queries

Prompt Injection nlp
PDF
attack arXiv Jan 5, 2026 · Jan 2026

Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization

Jiwei Guan, Haibo Jin, Haohan Wang · Macquarie University · University of Illinois Urbana-Champaign

Black-box gradient-free attack crafts adversarial images to jailbreak vision-language models with 83% ASR

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF
Loading more papers…