MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning

While Chain-of-Thought (CoT) prompting has become a standard paradigm for eliciting complex reasoning capabilities in Large Language Models, it inadvertently exposes a new attack surface for backdoor attacks. Existing CoT backdoor attacks typically manipulate the intermediate reasoning steps to steer the model toward incorrect answers. However, these corrupted reasoning traces are readily detected by prevalent process-monitoring defenses. To address this limitation, we introduce MirageBackdoor(MirageBD), the first backdoor attack to achieve Think Well but Answer Wrong. By unlocking the model's post-output space alongside a tailored training procedure, MirageBD enables the triggered model to preserve clean CoTs while selectively steering the final answer toward a specific target, significantly enhancing the stealthiness of the attack. Experiments show that MirageBD generally achieves over 90% attack success rate across four datasets and five models with a poison ratio of only 5%. Moreover, even under rigorous evaluations such as trigger perturbations and CoT-based detection, MirageBD maintains robust performance and stealthiness, posing a critical challenge to existing safety guardrails.

Key Contributions

First backdoor attack achieving 'Think Well, Answer Wrong' - preserves clean reasoning traces while corrupting final answers
Novel post-output space manipulation technique that bypasses CoT-based process-monitoring defenses
Achieves >90% attack success rate across 4 datasets and 5 models with only 5% poison ratio while maintaining stealth against detection

🛡️ Threat Analysis

Model Poisoning

Core contribution is a backdoor attack that embeds trigger-activated malicious behavior (wrong answers) in LLMs while maintaining normal behavior otherwise. The attack uses poisoned training data (5% poison ratio) to insert a hidden, targeted behavior that activates only with specific triggers.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timetargeted

Datasets

GSM8KMATHMathQAAQUA

Applications

2025 0 cit.

Model PoisoningData Poisoning Attack

75%

MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

SteganoBackdoor: Stealthy and Data-Efficient Backdoor Attacks on Language Models

On The Dangers of Poisoned LLMs In Security Automation