Robin Staab

defense arXiv Sep 29, 2025 · Sep 2025

Watermarking Diffusion Language Models

Thibaud Gloaguen, Robin Staab, Nikola Jovanović et al. · ETH Zürich

First watermarking scheme for diffusion LLMs, achieving >99% true positive rate with minimal text quality degradation

Output Integrity Attack nlpgenerative

4 citations 1 influentialPDF

attack arXiv Oct 9, 2025 · Oct 2025

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Kazuki Egashira, Robin Staab, Thibaud Gloaguen et al. · ETH Zürich

Crafts trojaned LLM weights appearing benign that activate jailbreak or safety bypass after standard pruning with vLLM

Model Poisoning nlp

3 citations PDF

defense arXiv Feb 6, 2026 · 8w ago

A Unified Framework for LLM Watermarks

Thibaud Gloaguen, Robin Staab, Nikola Jovanović et al. · ETH Zürich

Unifies LLM watermarking schemes under constrained optimization, revealing quality-diversity-power trade-offs and enabling principled design of optimal schemes

Output Integrity Attack nlp

PDF

attack arXiv Oct 21, 2025 · Oct 2025

Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation

Giovanni De Muri, Mark Vero, Robin Staab et al. · ETH Zürich

Introduces T-MTB backdoor attack that survives LLM knowledge distillation by using frequent, composite trigger tokens

Model Poisoning Transfer Learning Attack nlp

PDF

Papers in Database (4)

Watermarking Diffusion Language Models

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

A Unified Framework for LLM Watermarks

Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation