attack 2026

Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

Swapnil Parekh

0 citations

α

Published on arXiv

2604.00770

Model Poisoning

OWASP ML Top 10 — ML10

Data Poisoning Attack

OWASP ML Top 10 — ML02

Key Finding

Achieves 100% attack success rate on Coconut (GPT-2 124M) with ≤1.5% clean accuracy degradation, 100% ASR at baseline accuracy on SimCoT (Llama-3.2-1B), evades all five evaluated defenses, and transfers to held-out benchmarks at 94-100% without retraining

ThoughtSteer

Novel technique introduced


A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a single embedding vector at the input layer; the model's own multi-pass reasoning amplifies this perturbation into a hijacked latent trajectory that reliably produces the attacker's chosen answer, while remaining structurally invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks, and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying mechanism: Neural Collapse in the latent space pulls triggered representations onto a tight geometric attractor, explaining both why defenses fail and why any effective backdoor must leave a linearly separable signature (probe AUC>=0.999). Yet a striking paradox emerges: individual latent vectors still encode the correct answer even as the model outputs the wrong one. The adversarial information is not in any single vector but in the collective trajectory, establishing backdoor perturbations as a new lens for mechanistic interpretability of continuous reasoning. Code and checkpoints are available.


Key Contributions

  • First backdoor attack on continuous latent reasoning models (Coconut, SimCoT) that operates entirely in hidden states with no token-level audit trail
  • Achieves ≥99% attack success rate across three benchmarks with near-baseline clean accuracy, transfers OOD (94-100%), and survives 25 epochs of clean fine-tuning
  • Reveals Neural Collapse mechanism in latent space explaining why backdoors leave linearly separable signatures (probe AUC≥0.999) and why defenses fail, plus mechanistic insight that individual vectors encode correct answers while collective trajectory produces wrong outputs

🛡️ Threat Analysis

Data Poisoning Attack

The attack is implemented via training-time poisoning — the backdoor is inserted by corrupting the training process to learn the malicious trigger embedding, not by post-hoc weight manipulation.

Model Poisoning

ThoughtSteer is a backdoor attack that embeds a hidden trigger (learned continuous embedding φ) into latent reasoning models, causing them to output attacker-chosen answers when triggered while behaving normally otherwise — classic backdoor/trojan behavior with trigger-based activation.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeinference_timetargeteddigital
Datasets
GSM8K
Applications
reasoningchain-of-thought modelscontinuous latent reasoning