SteganoBackdoor: Stealthy and Data-Efficient Backdoor Attacks on Language Models
Eric Xue , Ruiyi Zhang , Pengtao Xie
Published on arXiv
2511.14301
Model Poisoning
OWASP ML Top 10 — ML10
Data Poisoning Attack
OWASP ML Top 10 — ML02
Training Data Poisoning
OWASP LLM Top 10 — LLM03
Key Finding
SteganoBackdoor achieves higher defense-evading attack success rates than prior semantic and stylized trigger attacks while requiring significantly lower poisoning budgets, remaining effective when multiple data-curation defenses are applied jointly.
SteganoBackdoor
Novel technique introduced
Modern language models remain vulnerable to backdoor attacks via poisoned data, where training inputs containing a trigger are paired with a target output, causing the model to reproduce that behavior whenever the trigger appears at inference time. Recent work has emphasized stealthy attacks that stress-test data-curation defenses using stylized artifacts or token-level perturbations as triggers, but this focus leaves a more practically relevant threat model underexplored: backdoors tied to naturally occurring semantic concepts. We introduce SteganoBackdoor, an optimization-based framework that constructs SteganoPoisons, steganographic poisoned training examples in which a backdoor payload is distributed across a fluent sentence while exhibiting no representational overlap with the inference-time semantic trigger. Across diverse model architectures, SteganoBackdoor achieves high attack success under constrained poisoning budgets and remains effective under conservative data-level filtering, highlighting a blind spot in existing data-curation defenses.
Key Contributions
- SteganoBackdoor framework that constructs steganographic poisoned examples (SteganoPoisons) via iterative gradient-guided token substitution, eliminating lexical overlap with the inference-time semantic trigger while preserving fluency and backdoor payload strength.
- Demonstrates high attack success rates (ASR) at sub-percent poisoning rates across encoder and GPT-style models ranging from 120M to 14B parameters.
- Shows that existing data-curation defenses implicitly assume backdoor artifacts either degrade fluency or have probe-accessible trigger representations — SteganoBackdoor violates both assumptions and achieves substantially higher defense-evading ASR than prior methods.
🛡️ Threat Analysis
The attack vector is the training data itself — an adversary acting as an upstream data provider injects SteganoPoisons into the training corpus. Co-tagging with ML10 per guidelines for backdoor-via-data-poisoning papers.
Core contribution is a backdoor attack: poisoned training examples encode a hidden payload that activates when a semantic trigger appears at inference time, while the model behaves normally on clean inputs — the defining characteristic of ML10.