attack 2026

Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models

0 citations · 39 references · arXiv

Published on arXiv

2601.19061

Data Poisoning Attack

OWASP ML Top 10 — ML02

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

Achieves over 70% targeted behavior injection success on tasks never present in training, while boosting benchmark scores 10–15%, making poisoned datasets attractive and hard to detect with existing defenses.

Thought-Transfer

Novel technique introduced

Chain-of-Thought (CoT) reasoning has emerged as a powerful technique for enhancing large language models' capabilities by generating intermediate reasoning steps for complex tasks. A common practice for equipping LLMs with reasoning is to fine-tune pre-trained models using CoT datasets from public repositories like HuggingFace, which creates new attack vectors targeting the reasoning traces themselves. While prior works have shown the possibility of mounting backdoor attacks in CoT-based models, these attacks require explicit inclusion of triggered queries with flawed reasoning and incorrect answers in the training set to succeed. Our work unveils a new class of Indirect Targeted Poisoning attacks in reasoning models that manipulate responses of a target task by transferring CoT traces learned from a different task. Our "Thought-Transfer" attack can influence the LLM output on a target task by manipulating only the training samples' CoT traces, while leaving the queries and answers unchanged, resulting in a form of ``clean label'' poisoning. Unlike prior targeted poisoning attacks that explicitly require target task samples in the poisoned data, we demonstrate that thought-transfer achieves 70% success rates in injecting targeted behaviors into entirely different domains that are never present in training. Training on poisoned reasoning data also improves the model's performance by 10-15% on multiple benchmarks, providing incentives for a user to use our poisoned reasoning dataset. Our findings reveal a novel threat vector enabled by reasoning models, which is not easily defended by existing mitigations.

Key Contributions

Introduces 'Thought-Transfer', a clean-label indirect targeted poisoning attack that manipulates only CoT reasoning traces in training data (queries and answers unchanged) to inject targeted behaviors into LLMs
Demonstrates cross-domain behavioral transfer: poisoning training samples from one domain (e.g., organic chemistry) successfully induces targeted outputs on a completely different target domain (e.g., online privacy) at 70%+ success rates
Shows that poisoned CoT datasets simultaneously improve model benchmark performance by 10–15% on GPQA, MATH-500, and AIME24 — creating a dangerous incentive for practitioners to adopt them — while evading perplexity-based filtering and LLM-based consistency autoraters

🛡️ Threat Analysis

Data Poisoning Attack

Primary contribution is a clean-label data poisoning attack on publicly shared CoT training datasets (HuggingFace/GitHub). The adversary modifies only reasoning traces while preserving queries and correct answers, manipulating model behavior on target tasks through the training data supply.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timetargetedblack_box

Datasets

GPQAMATH-500AIME24

Applications

llm reasoning fine-tuningchain-of-thought datasetsllm-based recommendation and code generation

Read PDF arXiv DOI

Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

RAG-targeted Adversarial Attack on LLM-based Threat Detection and Mitigation Framework

Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Subliminal Signals in Preference Labels

A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models

RAGShield: Provenance-Verified Defense-in-Depth Against Knowledge Base Poisoning in Government Retrieval-Augmented Generation Systems