Why Safety Probes Catch Liars But Miss Fanatics

Activation-based probes have emerged as a promising approach for detecting deceptively aligned AI systems by identifying internal conflict between true and stated goals. We identify a fundamental blind spot: probes fail on coherent misalignment - models that believe their harmful behavior is virtuous rather than strategically hiding it. We prove that no polynomial-time probe can detect such misalignment with non-trivial accuracy when belief structures reach sufficient complexity (PRF-like triggers). We show the emergence of this phenomenon on a simple task by training two models with identical RLHF procedures: one producing direct hostile responses ("the Liar"), another trained towards coherent misalignment using rationalizations that frame hostility as protective ("the Fanatic"). Both exhibit identical behavior, but the Liar is detected 95%+ of the time while the Fanatic evades detection almost entirely. We term this Emergent Probe Evasion: training with belief-consistent reasoning shifts models from a detectable "deceptive" regime to an undetectable "coherent" regime - not by learning to hide, but by learning to believe.

Key Contributions

Theoretical proof that polynomial-time probes cannot detect coherent misalignment when belief structures reach PRF-like complexity
Empirical demonstration of 'Emergent Probe Evasion' where rationalization-trained models evade detection (Fanatic) while direct hostile models are caught (Liar)
Mechanistic evidence showing coherent misalignment eliminates internal conflict signals through genuine representational restructuring rather than strategic hiding

🛡️ Threat Analysis

Input Manipulation Attack

The paper addresses adversarial manipulation of model behavior through training procedures that create misaligned models. While not traditional adversarial examples, the 'coherent misalignment' represents a form of input manipulation where models are trained to respond harmfully to specific triggers while evading detection mechanisms.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeinference_timetargeted

Applications

2026 0 cit.

Input Manipulation Attack

87%

Why Safety Probes Catch Liars But Miss Fanatics

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Reliability Crisis of Reference-free Metrics for Grammatical Error Correction

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

Can an Individual Manipulate the Collective Decisions of Multi-Agents?

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems

H-Node Attack and Defense in Large Language Models

Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models