attack 2025

Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated

0 citations

Published on arXiv

2509.05739

Model Poisoning

OWASP ML Top 10 — ML10

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

While decomposed reasoning backdoors can be injected into LLM chain-of-thought traces, the models frequently self-recover during reasoning, making reliable final-answer manipulation surprisingly difficult — suggesting an emergent backdoor robustness in reasoning-capable LLMs.

Decomposed Reasoning Poison

Novel technique introduced

Early research into data poisoning attacks against Large Language Models (LLMs) demonstrated the ease with which backdoors could be injected. More recent LLMs add step-by-step reasoning, expanding the attack surface to include the intermediate chain-of-thought (CoT) and its inherent trait of decomposing problems into subproblems. Using these vectors for more stealthy poisoning, we introduce ``decomposed reasoning poison'', in which the attacker modifies only the reasoning path, leaving prompts and final answers clean, and splits the trigger across multiple, individually harmless components. Fascinatingly, while it remains possible to inject these decomposed poisons, reliably activating them to change final answers (rather than just the CoT) is surprisingly difficult. This difficulty arises because the models can often recover from backdoors that are activated within their thought processes. Ultimately, it appears that an emergent form of backdoor robustness is originating from the reasoning capabilities of these advanced LLMs, as well as from the architectural separation between reasoning and final answer generation.

Key Contributions

Introduces 'decomposed reasoning poison' — a stealthy backdoor attack that modifies only the reasoning path (not prompts or final answers) and splits triggers across multiple individually harmless components.
Empirically demonstrates that reliably activating injected backdoors to flip final answers (rather than just corrupting CoT) is surprisingly difficult in reasoning LLMs.
Identifies an emergent form of backdoor robustness arising from the reasoning capabilities of advanced LLMs and the architectural separation between reasoning traces and final answer generation.

🛡️ Threat Analysis

Model Poisoning

The paper's primary contribution is 'decomposed reasoning poison' — a backdoor attack where triggers are split across multiple innocuous components and injected into the chain-of-thought reasoning path, activating targeted malicious behavior; this is directly ML10 (backdoor/trojan injection with trigger-based activation).

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timetargeted

Applications

llm chain-of-thought reasoningreasoning models

Read PDF arXiv

Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

SteganoBackdoor: Stealthy and Data-Efficient Backdoor Attacks on Language Models

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

On The Dangers of Poisoned LLMs In Security Automation

Semantically-Equivalent Transformations-Based Backdoor Attacks against Neural Code Models: Characterization and Mitigation

SASER: Stego attacks on open-source LLMs