ML Security Papers

Latest papers

48 papers

defense arXiv Apr 12, 2026 · 5w ago

Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation

Zeqian Long, Ozgur Kara, Haotian Xue et al. · University of Illinois Urbana-Champaign · Georgia Institute of Technology

Adversarial immunization that corrupts image-to-video generation by enforcing temporal latent divergence and trajectory misalignment across frames

Input Manipulation Attack visionmultimodalgenerative

PDF Code

attack arXiv Apr 11, 2026 · 5w ago

When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

Jose Efraim Aguilar Escamilla, Haoyang Hong, Jiawei Li et al. · Oregon State University · University of Illinois Urbana-Champaign +2 more

Characterizes when reward poisoning attacks can force RL agents to adopt attacker-chosen policies in linear MDPs

Model Skewing reinforcement-learning

PDF

defense arXiv Apr 10, 2026 · 5w ago

AudioGuard: Toward Comprehensive Audio Safety Protection Across Diverse Threat Models

Mintong Kang, Chen Fang, Bo Li · University of Illinois Urbana-Champaign

Comprehensive audio safety guardrail detecting harmful sounds, voice impersonation, child voice misuse, and risky voice-content combinations

Input Manipulation Attack Output Integrity Attack Prompt Injection audionlpmultimodal

PDF

defense arXiv Apr 6, 2026 · 6w ago

ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

Zhuowen Yuan, Zhaorun Chen, Zhen Xiang et al. · University of Illinois Urbana-Champaign · Virtue AI +6 more

Network-level guardrail detecting supply-chain poisoning in LLM agent MCP tools via MITM proxy monitoring network behaviors

AI Supply Chain Attacks Insecure Plugin Design nlp

PDF

defense arXiv Apr 5, 2026 · 6w ago

CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

Siyuan Li, Zehao Liu, Xi Lin et al. · Shanghai Jiao Tong University · University of Illinois Urbana-Champaign +1 more

Multi-agent cooperative defense system that adapts across rounds to counter evolving LLM jailbreak attacks through deception and forensic analysis

Prompt Injection Excessive Agency nlp

PDF

attack arXiv Apr 3, 2026 · 6w ago

Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

Yuheng Zhang, Mingyue Huo, Minghao Zhu et al. · University of Illinois Urbana-Champaign · University of Massachusetts Amherst

Token-space adversarial attack on RLHF reward models that bypasses semantic constraints to generate nonsensical high-reward outputs

Input Manipulation Attack nlp

PDF

attack arXiv Mar 23, 2026 · 8w ago

Adversarial Vulnerabilities in Neural Operator Digital Twins: Gradient-Free Attacks on Nuclear Thermal-Hydraulic Surrogates

Samrendra Roy, Kazuma Kobayashi, Souvik Chakraborty et al. · University of Illinois Urbana-Champaign · Indian Institute of Technology Delhi +1 more

Gradient-free adversarial attacks on neural operator digital twins causing catastrophic field prediction failures through sparse physically-plausible perturbations

Input Manipulation Attack vision

PDF

Operator learning models are rapidly emerging as the predictive core of digital twins for nuclear and energy systems, promising real-time field reconstruction from sparse sensor measurements. Yet their robustness to adversarial perturbations remains uncharacterized, a critical gap for deployment in safety-critical systems. Here we show that neural operators are acutely vulnerable to extremely sparse (fewer than 1% of inputs), physically plausible perturbations that exploit their sensitivity to boundary conditions. Using gradient-free differential evolution across four operator architectures, we demonstrate that minimal modifications trigger catastrophic prediction failures, increasing relative $L_2$ error from $\sim$1.5% (validated accuracy) to 37-63% while remaining completely undetectable by standard validation metrics. Notably, 100% of successful single-point attacks pass z-score anomaly detection. We introduce the effective perturbation dimension $d_{\text{eff}}$, a Jacobian-based diagnostic that, together with sensitivity magnitude, yields a two-factor vulnerability model explaining why architectures with extreme sensitivity concentration (POD-DeepONet, $d_{\text{eff}} \approx 1$) are not necessarily the most exploitable, since low-rank output projections cap maximum error, while moderate concentration with sufficient amplification (S-DeepONet, $d_{\text{eff}} \approx 4$) produces the highest attack success. Gradient-free search outperforms gradient-based alternatives (PGD) on architectures with gradient pathologies, while random perturbations of equal magnitude achieve near-zero success rates, confirming that the discovered vulnerabilities are structural. Our findings expose a previously overlooked attack surface in operator learning models and establish that these models require robustness guarantees beyond standard validation before deployment.

traditional_ml University of Illinois Urbana-Champaign · Indian Institute of Technology Delhi · National Center for Supercomputing Applications

PDF arXiv

benchmark arXiv Mar 11, 2026 · 10w ago

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

Xiangwen Wang, Ananth Balashankar, Varun Chandrasekaran · Google DeepMind · University of Illinois Urbana-Champaign

Scaling-law framework comparing four LLM jailbreak paradigms by FLOPs budget, finding prompt-based attacks dominate compute efficiency

Input Manipulation Attack Prompt Injection nlp

PDF

survey arXiv Mar 11, 2026 · 10w ago

The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey

Juhee Kim, Xiaoyuan Liu, Zhun Wang et al. · University of California · Seoul National University +1 more

Surveys attacks and defenses across agentic LLM systems, covering prompt injection, insecure tool use, and excessive agency risks

Prompt Injection Insecure Plugin Design Excessive Agency nlpmultimodal

PDF

benchmark arXiv Feb 13, 2026 · Feb 2026

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

Xu Li, Simon Yu, Minzhou Pan et al. · Northeastern University · Virtue AI +2 more

Benchmarks multi-turn jailbreaks in tool-using LLM agents and proposes ToolShield, a self-exploration defense reducing ASR by 30%

Prompt Injection Insecure Plugin Design nlp

PDF Code

benchmark arXiv Feb 7, 2026 · Feb 2026

Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents

Sai Puppala, Ismail Hossain, Md Jahangir Alam et al. · Southern Illinois University · University of Texas +2 more

Benchmarks LLM agent architectures across 14 attack classes, exposing authorization confusion and tool hijacking as dominant structural risks

Excessive Agency Insecure Plugin Design Prompt Injection nlp

PDF

defense arXiv Feb 4, 2026 · Feb 2026

E-Globe: Scalable $ε$-Global Verification of Neural Networks via Tight Upper Bounds and Pattern-Aware Branching

Wenting Li, Saif R. Kazi, Russell Bent et al. · University of Texas at Austin · Los Alamos National Laboratory +1 more

Branch-and-bound neural network verifier using NLP-CC upper bounds to certify or disprove adversarial robustness more efficiently than MIP methods

Input Manipulation Attack vision

PDF

attack arXiv Jan 30, 2026 · Jan 2026

Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models

Ye Yu, Haibo Jin, Yaoning Yu et al. · University of Illinois Urbana-Champaign · Boise State University

Audio narrative jailbreak using TTS achieves 98.26% success rate against safety-aligned audio-language models like Gemini 2.0 Flash

Prompt Injection audiomultimodalnlp

1 citations PDF

attack arXiv Jan 22, 2026 · Jan 2026

Connect the Dots: Knowledge Graph-Guided Crawler Attack on Retrieval-Augmented Generation Systems

Mengyu Yao, Ziqi Zhang, Ning Luo et al. · Peking University · University of Illinois Urbana-Champaign

Attacks RAG systems to steal private knowledge bases via knowledge-graph-guided adaptive queries, achieving 84.4% corpus coverage in 1,000 queries

Sensitive Information Disclosure nlp

PDF

attack arXiv Jan 21, 2026 · Jan 2026

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim et al. · University of Michigan · LG AI Research +1 more

Crafted agent chain-of-thought reasoning inflates LLM/VLM judge false positives by up to 90% across 800 web-task trajectories

Prompt Injection nlpmultimodal

1 citations PDF

benchmark arXiv Jan 19, 2026 · Jan 2026

Verifying Local Robustness of Pruned Safety-Critical Networks

Minh Le, Phuong Cao · Georgia Institute of Technology · University of Illinois Urbana-Champaign

Empirically shows pruning ratio non-linearly affects formal L∞ adversarial robustness certificates in safety-critical vision models

Input Manipulation Attack vision

PDF

attack arXiv Jan 16, 2026 · Jan 2026

Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

Kaiyu Zhou, Yongsen Zheng, Yicheng He et al. · Nanyang Technological University · University of Illinois Urbana-Champaign +2 more

Stealthy multi-turn economic DoS attack manipulates MCP tool servers to inflate LLM agent costs 658x while keeping task outputs correct

Model Denial of Service Insecure Plugin Design nlp

2 citations 1 influentialPDF

defense arXiv Jan 7, 2026 · Jan 2026

ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

Xiao Lin, Philip Li, Zhichen Zeng et al. · University of Illinois Urbana-Champaign · Visa

Defends LLMs against jailbreaks by amplifying internal layer/module/token feature discrepancies to detect attacks without training examples

Prompt Injection nlp

2 citations PDF

attack arXiv Jan 6, 2026 · Jan 2026

Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search

Devang Kulshreshtha, Hang Su, Chinmay Hegde et al. · Amazon · New York University +1 more

Attacker-LLM-free multi-turn jailbreak via lexical anchor injection achieves 97-100% ASR on GPT/Claude/Llama in ~6.4 queries

Prompt Injection nlp

PDF

attack arXiv Jan 5, 2026 · Jan 2026

Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization

Jiwei Guan, Haibo Jin, Haohan Wang · Macquarie University · University of Illinois Urbana-Champaign

Black-box gradient-free attack crafts adversarial images to jailbreak vision-language models with 83% ASR

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF

Loading more papers…

Latest papers

Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation

When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

AudioGuard: Toward Comprehensive Audio Safety Protection Across Diverse Threat Models

ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

Adversarial Vulnerabilities in Neural Operator Digital Twins: Gradient-Free Attacks on Nuclear Thermal-Hydraulic Surrogates

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents

E-Globe: Scalable $ε$-Global Verification of Neural Networks via Tight Upper Bounds and Pattern-Aware Branching

Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models

Connect the Dots: Knowledge Graph-Guided Crawler Attack on Retrieval-Augmented Generation Systems

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

Verifying Local Robustness of Pruned Safety-Critical Networks

Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search

Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue