attack 2026

Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models

Sharan Ramjee

0 citations · ICLR

α

Published on arXiv

2604.23460

Model Poisoning

OWASP ML Top 10 — ML10

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Linear probes trained on behaviorally-distinguishable conditions ([T][O] vs [O]) successfully transfer to detecting armed-but-benign states ([T] vs baseline) with high accuracy, revealing that misaligned and aligned reasoning occupy geometrically distinct latent space regions

Dual-Trigger Backdoor for Continuous Thought

Novel technique introduced


Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the model's expressive bandwidth. Continuous thought models address this bottleneck by reasoning in latent space rather than human-readable tokens. While they enable richer representations and faster inference, they raise a critical safety question: how can we detect misaligned reasoning in an uninterpretable latent space? To study this, we introduce MoralChain, a benchmark of 12,000 social scenarios with parallel moral/immoral reasoning paths. We train a continuous thought model with backdoor behavior using a novel dual-trigger paradigm - one trigger that arms misaligned latent reasoning ([T]) and another that releases harmful outputs ([O]). We demonstrate three findings: (1) continuous thought models can exhibit misaligned latent reasoning while producing aligned outputs, with aligned and misaligned reasoning occupying geometrically distinct regions of latent space; (2) linear probes trained on behaviorally-distinguishable conditions ([T][O] vs [O]) transfer to detecting armed-but-benign states ([T] vs baseline) with high accuracy; and (3) misalignment is encoded in early latent thinking tokens, suggesting safety monitoring for continuous thought models should target the "planning" phase of latent reasoning.


Key Contributions

  • Novel dual-trigger backdoor paradigm decoupling arming ([T]) from behavioral expression ([O]) in continuous thought models
  • MoralChain benchmark with 12,000 social scenarios featuring parallel moral/immoral reasoning paths
  • Demonstration that linear probes trained on behaviorally-distinguishable conditions transfer to detecting armed-but-benign states with high accuracy
  • Finding that misalignment is encoded in early latent thinking tokens during the 'planning' phase

🛡️ Threat Analysis

Input Manipulation Attack

Secondary contribution includes probe-based detection of misaligned reasoning states, functioning as a defense mechanism against the backdoor behavior.

Model Poisoning

Core contribution is a novel dual-trigger backdoor paradigm that embeds hidden misaligned behavior in continuous thought models—one trigger arms misaligned latent reasoning, another releases harmful outputs.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeinference_timetargeted
Datasets
MoralChain
Applications
continuous thought modelslatent reasoning systemsllm safety monitoring