ML Security Papers

Latest papers

31 papers

attack arXiv Mar 1, 2026 · 5w ago

Subliminal Signals in Preference Labels

Isotta Magistrali, Frédéric Berdoz, Sam Dauncey et al. · ETH Zürich

Biased LLM judge covertly encodes behavioral traits into student models via binary RLHF preference labels, bypassing semantic oversight

Transfer Learning Attack Data Poisoning Attack Training Data Poisoning nlp

PDF Code

defense arXiv Feb 12, 2026 · 7w ago

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Yannick Assogba, Jacopo Cortellazzi, Javier Abad et al. · Apple · ETH Zürich

Defends LLMs against jailbreaks via SAE feature-space steering, outperforming dense activation steering on four models across twelve attacks

Prompt Injection nlp

PDF

benchmark arXiv Feb 6, 2026 · 8w ago

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey et al. · Critical ML Lab · FAR.AI +6 more

Benchmark framework for evaluating LLM tamper resistance across 9 fine-tuning and weight-space attacks on 21 open-weight models

Transfer Learning Attack Prompt Injection nlp

1 citations PDF Code

defense arXiv Feb 6, 2026 · 8w ago

A Unified Framework for LLM Watermarks

Thibaud Gloaguen, Robin Staab, Nikola Jovanović et al. · ETH Zürich

Unifies LLM watermarking schemes under constrained optimization, revealing quality-diversity-power trade-offs and enabling principled design of optimal schemes

Output Integrity Attack nlp

PDF

attack arXiv Feb 5, 2026 · 8w ago

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Xin Chen, Jie Zhang, Florian Tramèr · ETH Zürich

RL-trained 1.5B model generates universal, transferable prompt injection suffixes that compromise GPT, Claude, and Gemini agents

Prompt Injection nlp

PDF

defense arXiv Jan 22, 2026 · 10w ago

Learning to Watermark in the Latent Space of Generative Models

Sylvestre-Alvise Rebuffi, Tuan Tran, Valeriu Lacatusu et al. · Meta · ETH Zürich

Embeds imperceptible watermarks in latent space of diffusion and autoregressive models, enabling 20x faster in-model content provenance tracking

Output Integrity Attack visiongenerative

PDF Code

defense arXiv Jan 17, 2026 · 11w ago

Taming Various Privilege Escalation in LLM-Based Agent Systems: A Mandatory Access Control Framework

Zimo Ji, Daoyuan Wu, Wenyuan Jiang et al. · Hong Kong University of Science and Technology · Lingnan University +3 more

Proposes SEAgent, a mandatory access control framework that blocks privilege escalation attacks in LLM agent tool use via information flow monitoring and ABAC policies

Prompt Injection Excessive Agency nlp

1 citations PDF

defense arXiv Jan 14, 2026 · 11w ago

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Hanna Foerster, Tom Blanchard, Kristina Nikolić et al. · University of Cambridge · University of Toronto +3 more

Defends computer-use AI agents against prompt injection via pre-computed execution graphs, revealing Branch Steering as a residual threat

Prompt Injection Excessive Agency nlpmultimodal

1 citations PDF

defense arXiv Dec 1, 2025 · Dec 2025

Dual Randomized Smoothing: Beyond Global Noise Variance

Chenhao Sun, Yuhao Mao, Martin Vechev · ETH Zürich

Proposes input-dependent noise variance in Randomized Smoothing to simultaneously certify robustness at both small and large perturbation radii

Input Manipulation Attack vision

1 citations PDF

attack arXiv Oct 28, 2025 · Oct 2025

SPEAR++: Scaling Gradient Inversion via Sparsely-Used Dictionary Learning

Alexander Bakarsky, Dimitar I. Dimitrov, Maximilian Baader et al. · ETH Zürich · INSAIT +1 more

Scales gradient inversion attacks in federated learning to 10x larger batch sizes using sparse dictionary learning

Model Inversion Attack federated-learning

PDF

attack arXiv Oct 27, 2025 · Oct 2025

Exploring Semantic-constrained Adversarial Example with Instruction Uncertainty Reduction

Jin Hu, Jiakai Wang, Linna Jing et al. · Beihang University · Zhongguancun Laboratory +1 more

Generates transferable semantically constrained adversarial images from natural language instructions using diffusion models with uncertainty reduction

Input Manipulation Attack visionmultimodal

PDF

Recently, semantically constrained adversarial examples (SemanticAE), which are directly generated from natural language instructions, have become a promising avenue for future research due to their flexible attacking forms. To generate SemanticAEs, current methods fall short of satisfactory attacking ability as the key underlying factors of semantic uncertainty in human instructions, such as referring diversity, descriptive incompleteness, and boundary ambiguity, have not been fully investigated. To tackle the issues, this paper develops a multi-dimensional instruction uncertainty reduction (InSUR) framework to generate more satisfactory SemanticAE, i.e., transferable, adaptive, and effective. Specifically, in the dimension of the sampling method, we propose the residual-driven attacking direction stabilization to alleviate the unstable adversarial optimization caused by the diversity of language references. By coarsely predicting the language-guided sampling process, the optimization process will be stabilized by the designed ResAdv-DDIM sampler, therefore releasing the transferable and robust adversarial capability of multi-step diffusion models. In task modeling, we propose the context-encoded attacking scenario constraint to supplement the missing knowledge from incomplete human instructions. Guidance masking and renderer integration are proposed to regulate the constraints of 2D/3D SemanticAE, activating stronger scenario-adapted attacks. Moreover, in the dimension of generator evaluation, we propose the semantic-abstracted attacking evaluation enhancement by clarifying the evaluation boundary, facilitating the development of more effective SemanticAE generators. Extensive experiments demonstrate the superiority of the transfer attack performance of InSUR. Moreover, we realize the reference-free generation of semantically constrained 3D adversarial examples for the first time.

diffusion cnn transformer Beihang University · Zhongguancun Laboratory · ETH Zürich

PDF arXiv DOI

benchmark arXiv Oct 26, 2025 · Oct 2025

Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

Julia Bazinska, Max Mathys, Francesco Casucci et al. · Lakera AI · ETH Zürich +2 more

Benchmarks 34 backbone LLMs against 194K crowdsourced adversarial attacks using a threat-snapshot framework for AI agent security

Prompt Injection Excessive Agency nlp

1 citations PDF

attack arXiv Oct 23, 2025 · Oct 2025

Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models

Tomáš Souček, Sylvestre-Alvise Rebuffi, Pierre Fernandez et al. · Meta · ETH Zürich

Forges image content watermarks in one shot via a gradient-based preference model, requiring no watermarking model access

Output Integrity Attack vision

PDF Code

benchmark arXiv Oct 21, 2025 · Oct 2025

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Artur Zolkowski, Wen Xing, David Lindner et al. · ETH Zürich · ML Alignment & Theory Scholars +1 more

Stress-tests CoT safety monitoring: reasoning models can hide malicious intent via prompt-induced obfuscation, collapsing detection from 96% to ~10%

Prompt Injection nlp

6 citations PDF Code

attack arXiv Oct 21, 2025 · Oct 2025

Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation

Giovanni De Muri, Mark Vero, Robin Staab et al. · ETH Zürich

Introduces T-MTB backdoor attack that survives LLM knowledge distillation by using frequent, composite trigger tokens

Model Poisoning Transfer Learning Attack nlp

PDF

attack arXiv Oct 19, 2025 · Oct 2025

Black-box Optimization of LLM Outputs by Asking for Directions

Jie Zhang, Meng Ding, Yang Liu et al. · ETH Zürich · University at Buffalo +1 more

Exploits LLMs' comparative confidence expressions as black-box optimization signal for adversarial image attacks, jailbreaks, and prompt injections

Input Manipulation Attack Prompt Injection visionnlpmultimodal

2 citations PDF Code

defense arXiv Oct 18, 2025 · Oct 2025

Patronus: Safeguarding Text-to-Image Models against White-Box Adversaries

Xinfeng Li, Shengyuan Pang, Jialin Wu et al. · Nanyang Technological University · Zhejiang University +1 more

Defends text-to-image diffusion models against white-box fine-tuning attacks via non-fine-tunable safety alignment and feature-level input moderation

Transfer Learning Attack visiongenerative

PDF

benchmark arXiv Oct 10, 2025 · Oct 2025

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Milad Nasr, Nicholas Carlini, Chawin Sitawarin et al. · OpenAI · Anthropic +6 more

Adaptive attacks via gradient descent, RL, and random search bypass 12 LLM jailbreak/prompt-injection defenses with >90% success rate

Input Manipulation Attack Prompt Injection nlp

34 citations 4 influentialPDF

attack arXiv Oct 10, 2025 · Oct 2025

Text Prompt Injection of Vision Language Models

Ruizhe Zhu · ETH Zürich

Embeds readable text instructions inside images to hijack VLM behavior, outperforming gradient-based attacks with far less compute

Input Manipulation Attack Prompt Injection visionnlpmultimodal

2 citations PDF Code

defense arXiv Oct 9, 2025 · Oct 2025

CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

Debeshee Das, Luca Beurer-Kellner, Marc Fischer et al. · ETH Zürich · Snyk

Defends LLM agents from indirect prompt injection by surgically removing AI-directed instructions from tool outputs at token level

Prompt Injection nlp

4 citations PDF

Loading more papers…

Latest papers

Subliminal Signals in Preference Labels

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

A Unified Framework for LLM Watermarks

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Learning to Watermark in the Latent Space of Generative Models

Taming Various Privilege Escalation in LLM-Based Agent Systems: A Mandatory Access Control Framework

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Dual Randomized Smoothing: Beyond Global Noise Variance

SPEAR++: Scaling Gradient Inversion via Sparsely-Used Dictionary Learning

Exploring Semantic-constrained Adversarial Example with Instruction Uncertainty Reduction

Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation

Black-box Optimization of LLM Outputs by Asking for Directions

Patronus: Safeguarding Text-to-Image Models against White-Box Adversaries

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Text Prompt Injection of Vision Language Models

CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue