attack 2025

Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated

Hanna Foerster 1, Ilia Shumailov 2, Yiren Zhao 2, Harsh Chaudhari 3, Jamie Hayes 1, Robert Mullins 4, Yarin Gal 5

0 citations

α

Published on arXiv

2509.05739

Model Poisoning

OWASP ML Top 10 — ML10

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

While decomposed reasoning backdoors can be injected into LLM chain-of-thought traces, the models frequently self-recover during reasoning, making reliable final-answer manipulation surprisingly difficult — suggesting an emergent backdoor robustness in reasoning-capable LLMs.

Decomposed Reasoning Poison

Novel technique introduced


Early research into data poisoning attacks against Large Language Models (LLMs) demonstrated the ease with which backdoors could be injected. More recent LLMs add step-by-step reasoning, expanding the attack surface to include the intermediate chain-of-thought (CoT) and its inherent trait of decomposing problems into subproblems. Using these vectors for more stealthy poisoning, we introduce ``decomposed reasoning poison'', in which the attacker modifies only the reasoning path, leaving prompts and final answers clean, and splits the trigger across multiple, individually harmless components. Fascinatingly, while it remains possible to inject these decomposed poisons, reliably activating them to change final answers (rather than just the CoT) is surprisingly difficult. This difficulty arises because the models can often recover from backdoors that are activated within their thought processes. Ultimately, it appears that an emergent form of backdoor robustness is originating from the reasoning capabilities of these advanced LLMs, as well as from the architectural separation between reasoning and final answer generation.


Key Contributions

  • Introduces 'decomposed reasoning poison' — a stealthy backdoor attack that modifies only the reasoning path (not prompts or final answers) and splits triggers across multiple individually harmless components.
  • Empirically demonstrates that reliably activating injected backdoors to flip final answers (rather than just corrupting CoT) is surprisingly difficult in reasoning LLMs.
  • Identifies an emergent form of backdoor robustness arising from the reasoning capabilities of advanced LLMs and the architectural separation between reasoning traces and final answer generation.

🛡️ Threat Analysis

Model Poisoning

The paper's primary contribution is 'decomposed reasoning poison' — a backdoor attack where triggers are split across multiple innocuous components and injected into the chain-of-thought reasoning path, activating targeted malicious behavior; this is directly ML10 (backdoor/trojan injection with trigger-based activation).


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargeted
Applications
llm chain-of-thought reasoningreasoning models