attack 2025

SteganoBackdoor: Stealthy and Data-Efficient Backdoor Attacks on Language Models

Eric Xue , Ruiyi Zhang , Pengtao Xie

0 citations · 54 references · arXiv (Cornell University)

α

Published on arXiv

2511.14301

Model Poisoning

OWASP ML Top 10 — ML10

Data Poisoning Attack

OWASP ML Top 10 — ML02

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

SteganoBackdoor achieves higher defense-evading attack success rates than prior semantic and stylized trigger attacks while requiring significantly lower poisoning budgets, remaining effective when multiple data-curation defenses are applied jointly.

SteganoBackdoor

Novel technique introduced


Modern language models remain vulnerable to backdoor attacks via poisoned data, where training inputs containing a trigger are paired with a target output, causing the model to reproduce that behavior whenever the trigger appears at inference time. Recent work has emphasized stealthy attacks that stress-test data-curation defenses using stylized artifacts or token-level perturbations as triggers, but this focus leaves a more practically relevant threat model underexplored: backdoors tied to naturally occurring semantic concepts. We introduce SteganoBackdoor, an optimization-based framework that constructs SteganoPoisons, steganographic poisoned training examples in which a backdoor payload is distributed across a fluent sentence while exhibiting no representational overlap with the inference-time semantic trigger. Across diverse model architectures, SteganoBackdoor achieves high attack success under constrained poisoning budgets and remains effective under conservative data-level filtering, highlighting a blind spot in existing data-curation defenses.


Key Contributions

  • SteganoBackdoor framework that constructs steganographic poisoned examples (SteganoPoisons) via iterative gradient-guided token substitution, eliminating lexical overlap with the inference-time semantic trigger while preserving fluency and backdoor payload strength.
  • Demonstrates high attack success rates (ASR) at sub-percent poisoning rates across encoder and GPT-style models ranging from 120M to 14B parameters.
  • Shows that existing data-curation defenses implicitly assume backdoor artifacts either degrade fluency or have probe-accessible trigger representations — SteganoBackdoor violates both assumptions and achieves substantially higher defense-evading ASR than prior methods.

🛡️ Threat Analysis

Data Poisoning Attack

The attack vector is the training data itself — an adversary acting as an upstream data provider injects SteganoPoisons into the training corpus. Co-tagging with ML10 per guidelines for backdoor-via-data-poisoning papers.

Model Poisoning

Core contribution is a backdoor attack: poisoned training examples encode a hidden payload that activates when a semantic trigger appears at inference time, while the model behaves normally on clean inputs — the defining characteristic of ML10.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxtraining_timetargeted
Applications
text classificationsentiment classificationlanguage model fine-tuning