Latest papers

31 papers
attack arXiv Mar 1, 2026 · 5w ago

Subliminal Signals in Preference Labels

Isotta Magistrali, Frédéric Berdoz, Sam Dauncey et al. · ETH Zürich

Biased LLM judge covertly encodes behavioral traits into student models via binary RLHF preference labels, bypassing semantic oversight

Transfer Learning Attack Data Poisoning Attack Training Data Poisoning nlp
PDF Code
defense arXiv Feb 12, 2026 · 7w ago

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Yannick Assogba, Jacopo Cortellazzi, Javier Abad et al. · Apple · ETH Zürich

Defends LLMs against jailbreaks via SAE feature-space steering, outperforming dense activation steering on four models across twelve attacks

Prompt Injection nlp
PDF
benchmark arXiv Feb 6, 2026 · 8w ago

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey et al. · Critical ML Lab · FAR.AI +6 more

Benchmark framework for evaluating LLM tamper resistance across 9 fine-tuning and weight-space attacks on 21 open-weight models

Transfer Learning Attack Prompt Injection nlp
1 citations PDF Code
defense arXiv Feb 6, 2026 · 8w ago

A Unified Framework for LLM Watermarks

Thibaud Gloaguen, Robin Staab, Nikola Jovanović et al. · ETH Zürich

Unifies LLM watermarking schemes under constrained optimization, revealing quality-diversity-power trade-offs and enabling principled design of optimal schemes

Output Integrity Attack nlp
PDF
attack arXiv Feb 5, 2026 · 8w ago

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Xin Chen, Jie Zhang, Florian Tramèr · ETH Zürich

RL-trained 1.5B model generates universal, transferable prompt injection suffixes that compromise GPT, Claude, and Gemini agents

Prompt Injection nlp
PDF
defense arXiv Jan 22, 2026 · 10w ago

Learning to Watermark in the Latent Space of Generative Models

Sylvestre-Alvise Rebuffi, Tuan Tran, Valeriu Lacatusu et al. · Meta · ETH Zürich

Embeds imperceptible watermarks in latent space of diffusion and autoregressive models, enabling 20x faster in-model content provenance tracking

Output Integrity Attack visiongenerative
PDF Code
defense arXiv Jan 17, 2026 · 11w ago

Taming Various Privilege Escalation in LLM-Based Agent Systems: A Mandatory Access Control Framework

Zimo Ji, Daoyuan Wu, Wenyuan Jiang et al. · Hong Kong University of Science and Technology · Lingnan University +3 more

Proposes SEAgent, a mandatory access control framework that blocks privilege escalation attacks in LLM agent tool use via information flow monitoring and ABAC policies

Prompt Injection Excessive Agency nlp
1 citations PDF
defense arXiv Jan 14, 2026 · 11w ago

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Hanna Foerster, Tom Blanchard, Kristina Nikolić et al. · University of Cambridge · University of Toronto +3 more

Defends computer-use AI agents against prompt injection via pre-computed execution graphs, revealing Branch Steering as a residual threat

Prompt Injection Excessive Agency nlpmultimodal
1 citations PDF
defense arXiv Dec 1, 2025 · Dec 2025

Dual Randomized Smoothing: Beyond Global Noise Variance

Chenhao Sun, Yuhao Mao, Martin Vechev · ETH Zürich

Proposes input-dependent noise variance in Randomized Smoothing to simultaneously certify robustness at both small and large perturbation radii

Input Manipulation Attack vision
1 citations PDF
attack arXiv Oct 28, 2025 · Oct 2025

SPEAR++: Scaling Gradient Inversion via Sparsely-Used Dictionary Learning

Alexander Bakarsky, Dimitar I. Dimitrov, Maximilian Baader et al. · ETH Zürich · INSAIT +1 more

Scales gradient inversion attacks in federated learning to 10x larger batch sizes using sparse dictionary learning

Model Inversion Attack federated-learning
PDF
attack arXiv Oct 27, 2025 · Oct 2025

Exploring Semantic-constrained Adversarial Example with Instruction Uncertainty Reduction

Jin Hu, Jiakai Wang, Linna Jing et al. · Beihang University · Zhongguancun Laboratory +1 more

Generates transferable semantically constrained adversarial images from natural language instructions using diffusion models with uncertainty reduction

Input Manipulation Attack visionmultimodal
PDF
benchmark arXiv Oct 26, 2025 · Oct 2025

Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

Julia Bazinska, Max Mathys, Francesco Casucci et al. · Lakera AI · ETH Zürich +2 more

Benchmarks 34 backbone LLMs against 194K crowdsourced adversarial attacks using a threat-snapshot framework for AI agent security

Prompt Injection Excessive Agency nlp
1 citations PDF
attack arXiv Oct 23, 2025 · Oct 2025

Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models

Tomáš Souček, Sylvestre-Alvise Rebuffi, Pierre Fernandez et al. · Meta · ETH Zürich

Forges image content watermarks in one shot via a gradient-based preference model, requiring no watermarking model access

Output Integrity Attack vision
PDF Code
benchmark arXiv Oct 21, 2025 · Oct 2025

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Artur Zolkowski, Wen Xing, David Lindner et al. · ETH Zürich · ML Alignment & Theory Scholars +1 more

Stress-tests CoT safety monitoring: reasoning models can hide malicious intent via prompt-induced obfuscation, collapsing detection from 96% to ~10%

Prompt Injection nlp
6 citations PDF Code
attack arXiv Oct 21, 2025 · Oct 2025

Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation

Giovanni De Muri, Mark Vero, Robin Staab et al. · ETH Zürich

Introduces T-MTB backdoor attack that survives LLM knowledge distillation by using frequent, composite trigger tokens

Model Poisoning Transfer Learning Attack nlp
PDF
attack arXiv Oct 19, 2025 · Oct 2025

Black-box Optimization of LLM Outputs by Asking for Directions

Jie Zhang, Meng Ding, Yang Liu et al. · ETH Zürich · University at Buffalo +1 more

Exploits LLMs' comparative confidence expressions as black-box optimization signal for adversarial image attacks, jailbreaks, and prompt injections

Input Manipulation Attack Prompt Injection visionnlpmultimodal
2 citations PDF Code
defense arXiv Oct 18, 2025 · Oct 2025

Patronus: Safeguarding Text-to-Image Models against White-Box Adversaries

Xinfeng Li, Shengyuan Pang, Jialin Wu et al. · Nanyang Technological University · Zhejiang University +1 more

Defends text-to-image diffusion models against white-box fine-tuning attacks via non-fine-tunable safety alignment and feature-level input moderation

Transfer Learning Attack visiongenerative
PDF
benchmark arXiv Oct 10, 2025 · Oct 2025

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Milad Nasr, Nicholas Carlini, Chawin Sitawarin et al. · OpenAI · Anthropic +6 more

Adaptive attacks via gradient descent, RL, and random search bypass 12 LLM jailbreak/prompt-injection defenses with >90% success rate

Input Manipulation Attack Prompt Injection nlp
34 citations 4 influentialPDF
attack arXiv Oct 10, 2025 · Oct 2025

Text Prompt Injection of Vision Language Models

Ruizhe Zhu · ETH Zürich

Embeds readable text instructions inside images to hijack VLM behavior, outperforming gradient-based attacks with far less compute

Input Manipulation Attack Prompt Injection visionnlpmultimodal
2 citations PDF Code
defense arXiv Oct 9, 2025 · Oct 2025

CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

Debeshee Das, Luca Beurer-Kellner, Marc Fischer et al. · ETH Zürich · Snyk

Defends LLM agents from indirect prompt injection by surgically removing AI-directed instructions from tool outputs at token level

Prompt Injection nlp
4 citations PDF
Loading more papers…