ML Security Papers

Latest papers

39 papers

attack arXiv Apr 27, 2026 · 24d ago

Jailbreaking Frontier Foundation Models Through Intention Deception

Xinhe Wang, Katia Sycara, Yaqi Xie · Carnegie Mellon University

Multi-turn jailbreaking attack that deceives LLM safety by simulating benign intent across conversations to elicit harmful outputs

Prompt Injection nlpmultimodal

PDF

defense arXiv Apr 18, 2026 · 4w ago

The Consensus Trap: Rescuing Multi-Agent LLMs from Adversarial Majorities via Token-Level Collaboration

Jiayuan Liu, Shiyi Du, Weihua Du et al. · Carnegie Mellon University · Foundations of Cooperative AI Lab +1 more

Token-level collaborative generation defends multi-agent LLM systems against prompt injection attacks that corrupt majority of agents

Prompt Injection nlp

PDF

defense arXiv Apr 9, 2026 · 6w ago

$\oslash$ Source Models Leak What They Shouldn't $\nrightarrow$: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization

Arnav Devalapally, Poornima Jain, Kartik Srinivas et al. · Indian Institute of Technology Hyderabad · University of Michigan +2 more

Machine unlearning method that removes source-domain class knowledge during domain adaptation to prevent privacy leakage via zero-shot transfer

Model Inversion Attack vision

PDF Code

attack arXiv Mar 19, 2026 · 9w ago

In-the-Wild Camouflage Attack on Vehicle Detectors through Controllable Image Editing

Xiao Fang, Yiming Gong, Stanislav Panev et al. · Carnegie Mellon University · DEVCOM Army Research Laboratory +1 more

Physical-world camouflage attack synthesizing adversarial vehicle textures via ControlNet fine-tuning, achieving 38% AP50 drop with transferability

Input Manipulation Attack vision

PDF Code

benchmark arXiv Mar 16, 2026 · 9w ago

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

Mateusz Dziemian, Maxwell Lin, Xiaohan Fu et al. · Gray Swan AI · OpenAI +6 more

Large-scale red teaming competition finds all frontier LLM agents vulnerable to concealed indirect prompt injection attacks with 0.5-8.5% success rates

Prompt Injection Excessive Agency nlpmultimodal

PDF

defense arXiv Mar 16, 2026 · 9w ago

Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

Ce Zhang, Jinxi He, Junyi He et al. · Carnegie Mellon University

Training-free safety framework using self-reflective memory to help VLMs distinguish safe vs unsafe requests in contextually similar scenarios

Prompt Injection multimodalnlpvision

PDF Code

defense arXiv Mar 5, 2026 · 11w ago

From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications

Xusheng Luo, Changliu Liu · Carnegie Mellon University

First coupled MILP-based certified robustness framework for keypoint detectors, bounding joint deviation across all keypoints under input perturbations

Input Manipulation Attack vision

PDF

attack arXiv Feb 28, 2026 · 11w ago

Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models

Ci Zhang, Zhaojun Ding, Chence Yang et al. · University of Georgia · Carnegie Mellon University +3 more

Attacks pruning-based unlearning in diffusion models by reviving erased concepts via side-channel signals from zeroed weight locations

Output Integrity Attack generativevision

PDF

benchmark arXiv Feb 24, 2026 · 12w ago

Personal Information Parroting in Language Models

Nishant Subramani, Kshitish Ghate, Mona Diab · Carnegie Mellon University · University of Washington

Measures verbatim PII leakage from Pythia LLMs via greedy decoding, finding 13.6% reproduction rate scaling with model size and training duration

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

benchmark arXiv Feb 23, 2026 · 12w ago

Agents of Chaos

Natalie Shapira, Chris Wendler, Avery Yen et al. · Northeastern University · Independent Researcher +11 more

Red-teams live autonomous LLM agents over two weeks, documenting 11 case studies of dangerous failures including system takeover, DoS, and sensitive data disclosure

Excessive Agency Prompt Injection Insecure Plugin Design nlp

3 citations PDF

attack arXiv Feb 14, 2026 · Feb 2026

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Ruomeng Ding, Yifei Pang, He Sun et al. · University of North Carolina at Chapel Hill · Carnegie Mellon University +2 more

Attacks LLM alignment pipelines by crafting benchmark-compliant rubric edits that systematically bias judge preferences and corrupt RLHF training

Transfer Learning Attack Prompt Injection nlp

PDF Code

benchmark arXiv Feb 13, 2026 · Feb 2026

Consistency of Large Reasoning Models Under Multi-Turn Attacks

Yubo Li, Ramayya Krishnan, Rema Padman · Carnegie Mellon University

Benchmarks nine reasoning LLMs against multi-turn natural-language adversarial attacks, identifying five failure modes and exposing confidence-based defense limitations

Prompt Injection nlp

PDF Code

defense arXiv Feb 3, 2026 · Feb 2026

Antidistillation Fingerprinting

Yixuan Even Xu, John Kirchenbauer, Yash Savani et al. · Carnegie Mellon University · University of Maryland

Fingerprints LLM outputs to detect unauthorized distillation using gradient-aligned token perturbations that transfer to student models

Model Theft Model Theft nlp

PDF

defense arXiv Jan 26, 2026 · Jan 2026

LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models

Kai Hu, Haoqi Hu, Matt Fredrikson · Carnegie Mellon University

Scales 1-Lipschitz certified robustness to billion-parameter vision models via manifold optimization and convolution-free architecture

Input Manipulation Attack vision

PDF

defense arXiv Jan 15, 2026 · Jan 2026

Serverless AI Security: Attack Surface Analysis and Runtime Protection Mechanisms for FaaS-Based Machine Learning

Chetan Pathade, Vinod Dhimam, Sheheryar Ahmad et al. · Carnegie Mellon University

Surveys ML attack surfaces on FaaS platforms and proposes Serverless AI Shield detecting 94% of threats with under 9% latency overhead

AI Supply Chain Attacks Model Theft

PDF

attack arXiv Dec 31, 2025 · Dec 2025

The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

Xiaoze Liu, Weichen Yu, Matt Fredrikson et al. · Purdue University · Carnegie Mellon University

Engineers a stealthy breaker token that lies dormant in donor LLMs but activates as a trojan after tokenizer transplant into a base model

AI Supply Chain Attacks Model Poisoning nlp

1 citations PDF Code

attack arXiv Dec 18, 2025 · Dec 2025

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

Kai Hu, Abhinav Aggarwal, Mehran Khodabandeh et al. · Meta Superintelligence Labs · Carnegie Mellon University

Policy-based red teaming framework fine-tunes an attack LLM to generate diverse, human-readable jailbreak prompts achieving SOTA ASR against GPT-4o and Claude 3.5

Prompt Injection Red-Team Agents nlp

PDF

defense arXiv Dec 3, 2025 · Dec 2025

MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

Yizhou Zhao, Zhiwei Steven Wu, Adam Block · University of Pennsylvania · Carnegie Mellon University +1 more

Fine-tuning framework that embeds robust watermarks into open-weight LLM weights, closing the quality-detectability gap with inference-time schemes

Output Integrity Attack nlp

PDF Code

benchmark arXiv Nov 28, 2025 · Nov 2025

Are LLMs Good Safety Agents or a Propaganda Engine?

Neemesh Yadav, Francesco Ortu, Jiarui Liu et al. · Southern Methodist University · University of Trieste +6 more

Benchmarks LLM refusal behaviors using prompt injection attacks to distinguish genuine safety guardrails from political censorship

Prompt Injection nlp

PDF

defense arXiv Nov 24, 2025 · Nov 2025

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Zhaolong Su, Wang Lu, Hao Chen et al. · William & Mary · Independent Researcher +2 more

Self-adversarial training framework for unified multimodal models that perturbs shared visual tokens to improve adversarial and OOD robustness

Input Manipulation Attack multimodalvisionnlp

PDF Code

Loading more papers…

Latest papers

Jailbreaking Frontier Foundation Models Through Intention Deception

The Consensus Trap: Rescuing Multi-Agent LLMs from Adversarial Majorities via Token-Level Collaboration

$\oslash$ Source Models Leak What They Shouldn't $\nrightarrow$: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization

In-the-Wild Camouflage Attack on Vehicle Detectors through Controllable Image Editing

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications

Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models

Personal Information Parroting in Language Models

Agents of Chaos

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Consistency of Large Reasoning Models Under Multi-Turn Attacks

Antidistillation Fingerprinting

LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models

Serverless AI Security: Attack Surface Analysis and Runtime Protection Mechanisms for FaaS-Based Machine Learning

The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

Are LLMs Good Safety Agents or a Propaganda Engine?

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue