DecepChain: Inducing Deceptive Reasoning in Large Language Models
Wei Shen , Han Wang , Haoyu Li , Huan Zhang
Published on arXiv
2510.00319
Model Poisoning
OWASP ML Top 10 — ML10
Transfer Learning Attack
OWASP ML Top 10 — ML07
Key Finding
DecepChain achieves high attack success rates with minimal benign performance degradation, and human raters cannot reliably distinguish manipulated reasoning traces from benign ones.
DecepChain
Novel technique introduced
Large Language Models (LLMs) have been demonstrating increasingly strong reasoning capability with their chain-of-thoughts (CoT), which are routinely used by humans to judge answer quality. This reliance creates a powerful yet fragile basis for trust. In this work, we present an urgent but underexplored risk: attackers could induce LLMs to generate incorrect yet coherent CoTs that look plausible at first glance, while leaving no obvious manipulated traces, closely resembling the reasoning exhibited in benign scenarios. In particular, we introduce DecepChain, a novel backdoor attack paradigm that steers models to generate reasoning that appears benign while yielding incorrect conclusions eventually. At a high level, DecepChain exploits LLMs' own hallucination and amplifies it by fine-tuning on naturally erroneous rollouts generated by the model itself and then reinforces it via Group Relative Policy Optimization (GRPO) with a flipped reward on triggered inputs, plus a plausibility regularizer to preserve fluent, benign-looking reasoning. Across multiple benchmarks and models, DecepChain achieves high attack success rates with minimal performance degradation on benign scenarios. Moreover, a careful human evaluation showed that the human raters struggle to distinguish our manipulated reasoning processes from benign ones, underscoring our attack's stealthiness. Left unaddressed, this stealthy failure mode can quietly corrupt LLM answers and undermine human trust for LLM reasoning, emphasizing the urgency for future research into this alarming risk. Project page: https://decepchain.github.io/.
Key Contributions
- DecepChain: a backdoor attack paradigm that fine-tunes LLMs on self-generated erroneous rollouts, then reinforces triggered deceptive CoT via GRPO with a flipped reward on triggered inputs
- Plausibility regularizer that preserves fluent, benign-looking reasoning, making manipulated CoTs indistinguishable from genuine reasoning in human evaluation
- Demonstrates high attack success rates across multiple benchmarks with minimal benign-scenario performance degradation and passes human rater scrutiny
🛡️ Threat Analysis
The attack mechanism explicitly manipulates the RLHF/GRPO preference optimization process (flipping rewards on triggered inputs) to embed malicious behavior, directly matching ML07's 'RLHF/preference manipulation to embed malicious behavior' criterion.
DecepChain is fundamentally a backdoor attack: the model behaves normally on benign inputs but activates hidden malicious behavior (incorrect CoT reasoning) on triggered inputs, injected via fine-tuning — the canonical ML10 threat.