benchmark 2025

All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language

Shiyuan Guo ¹, Henry Sleight ², Fabien Roger ¹

¹ Anthropic

² Constellation

2 citations · 1 influential · 51 references · arXiv

Published on arXiv

2510.09714

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Even frontier LLMs struggle to reason accurately in lesser-known ciphers despite being able to translate them, indicating that ciphered reasoning is currently an ineffective tactic for evading CoT safety monitoring.

Ciphered Reasoning Evaluation

Novel technique introduced

Detecting harmful AI actions is important as AI agents gain adoption. Chain-of-thought (CoT) monitoring is one method widely used to detect adversarial attacks and AI misalignment. However, attackers and misaligned models might evade CoT monitoring through ciphered reasoning: reasoning hidden in encrypted, translated, or compressed text. To assess this risk, we test whether models can perform ciphered reasoning. For each of 28 different ciphers, we fine-tune and prompt up to 10 models to reason in that cipher. We measure model accuracy on math problems as a proxy for reasoning ability. Across the models we test, we find an asymmetry: model accuracy can drop significantly when reasoning in ciphered text, even though models demonstrate comprehension of ciphered text by being able to translate it accurately to English. Even frontier models struggle with lesser-known ciphers, although they can reason accurately in well-known ciphers like rot13. We show that ciphered reasoning capability correlates with cipher prevalence in pretraining data. We also identify scaling laws showing that ciphered reasoning capability improves slowly with additional fine-tuning data. Our work suggests that evading CoT monitoring using ciphered reasoning may be an ineffective tactic for current models and offers guidance on constraining the development of this capability in future frontier models.

Key Contributions

Systematic empirical evaluation of 10 LLMs across 28 ciphers, revealing an asymmetry: models comprehend ciphered text (accurate translation) but struggle to reason in it (accuracy drops), suggesting ciphered CoT evasion is currently ineffective for most ciphers.
Correlation between cipher prevalence in pretraining data and models' ciphered reasoning accuracy, explaining why rot13 succeeds but lesser-known ciphers fail.
Data- and parameter-scaling laws showing ciphered reasoning capability improves very slowly with fine-tuning — a simple cipher requires >3.7B tokens of ciphered data to approach plain-text accuracy — with guidance on constraining this capability in frontier models.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timetraining_timegrey_box

Datasets

math problem benchmarks (proxy for reasoning ability)

Applications

chain-of-thought monitoringai safety oversightjailbreak detection

Read PDF arXiv DOI Code

All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms

AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Reliable Weak-to-Strong Monitoring of LLM Agents

Verifiability-First Agents: Provable Observability and Lightweight Audit Agents for Controlling Autonomous LLM Systems

Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting

Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents