SteganoBackdoor: Stealthy and Data-Efficient Backdoor Attacks on Language Models

Modern language models remain vulnerable to backdoor attacks via poisoned data, where training inputs containing a trigger are paired with a target output, causing the model to reproduce that behavior whenever the trigger appears at inference time. Recent work has emphasized stealthy attacks that stress-test data-curation defenses using stylized artifacts or token-level perturbations as triggers, but this focus leaves a more practically relevant threat model underexplored: backdoors tied to naturally occurring semantic concepts. We introduce SteganoBackdoor, an optimization-based framework that constructs SteganoPoisons, steganographic poisoned training examples in which a backdoor payload is distributed across a fluent sentence while exhibiting no representational overlap with the inference-time semantic trigger. Across diverse model architectures, SteganoBackdoor achieves high attack success under constrained poisoning budgets and remains effective under conservative data-level filtering, highlighting a blind spot in existing data-curation defenses.

Key Contributions

SteganoBackdoor framework that constructs steganographic poisoned examples (SteganoPoisons) via iterative gradient-guided token substitution, eliminating lexical overlap with the inference-time semantic trigger while preserving fluency and backdoor payload strength.
Demonstrates high attack success rates (ASR) at sub-percent poisoning rates across encoder and GPT-style models ranging from 120M to 14B parameters.
Shows that existing data-curation defenses implicitly assume backdoor artifacts either degrade fluency or have probe-accessible trigger representations — SteganoBackdoor violates both assumptions and achieves substantially higher defense-evading ASR than prior methods.

🛡️ Threat Analysis

Data Poisoning Attack

The attack vector is the training data itself — an adversary acting as an upstream data provider injects SteganoPoisons into the training corpus. Co-tagging with ML10 per guidelines for backdoor-via-data-poisoning papers.

Model Poisoning

Core contribution is a backdoor attack: poisoned training examples encode a hidden payload that activates when a semantic trigger appears at inference time, while the model behaves normally on clean inputs — the defining characteristic of ML10.