attack 2026

Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models

Sharan Ramjee

Stanford University

0 citations · ICLR

Published on arXiv

2604.23460

Model Poisoning

OWASP ML Top 10 — ML10

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Linear probes trained on behaviorally-distinguishable conditions ([T][O] vs [O]) successfully transfer to detecting armed-but-benign states ([T] vs baseline) with high accuracy, revealing that misaligned and aligned reasoning occupy geometrically distinct latent space regions

Dual-Trigger Backdoor for Continuous Thought

Novel technique introduced

Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the model's expressive bandwidth. Continuous thought models address this bottleneck by reasoning in latent space rather than human-readable tokens. While they enable richer representations and faster inference, they raise a critical safety question: how can we detect misaligned reasoning in an uninterpretable latent space? To study this, we introduce MoralChain, a benchmark of 12,000 social scenarios with parallel moral/immoral reasoning paths. We train a continuous thought model with backdoor behavior using a novel dual-trigger paradigm - one trigger that arms misaligned latent reasoning ([T]) and another that releases harmful outputs ([O]). We demonstrate three findings: (1) continuous thought models can exhibit misaligned latent reasoning while producing aligned outputs, with aligned and misaligned reasoning occupying geometrically distinct regions of latent space; (2) linear probes trained on behaviorally-distinguishable conditions ([T][O] vs [O]) transfer to detecting armed-but-benign states ([T] vs baseline) with high accuracy; and (3) misalignment is encoded in early latent thinking tokens, suggesting safety monitoring for continuous thought models should target the "planning" phase of latent reasoning.

Key Contributions

Novel dual-trigger backdoor paradigm decoupling arming ([T]) from behavioral expression ([O]) in continuous thought models
MoralChain benchmark with 12,000 social scenarios featuring parallel moral/immoral reasoning paths
Demonstration that linear probes trained on behaviorally-distinguishable conditions transfer to detecting armed-but-benign states with high accuracy
Finding that misalignment is encoded in early latent thinking tokens during the 'planning' phase

🛡️ Threat Analysis

Input Manipulation Attack

Secondary contribution includes probe-based detection of misaligned reasoning states, functioning as a defense mechanism against the backdoor behavior.

Model Poisoning

Core contribution is a novel dual-trigger backdoor paradigm that embeds hidden misaligned behavior in continuous thought models—one trigger arms misaligned latent reasoning, another releases harmful outputs.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeinference_timetargeted

Datasets

MoralChain

Applications

continuous thought modelslatent reasoning systemsllm safety monitoring

Read PDF arXiv

Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SafeSeek: Universal Attribution of Safety Circuits in Language Models

Why Safety Probes Catch Liars But Miss Fanatics

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning

Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

ImportSnare: Directed "Code Manual" Hijacking in Retrieval-Augmented Code Generation

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs