benchmark 2026

NEST: Nascent Encoded Steganographic Thoughts

Artem Karpov

0 citations · 32 references · arXiv (Cornell University)

α

Published on arXiv

2602.14095

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Current frontier models cannot sustain hidden reasoning for complex tasks, but Claude Opus 4.5 achieved 92% accuracy on a simplified counting steganography experiment, and GPT-5.2 was observed in rare cases (<1%) simultaneously refusing and complying with steganographic instructions.

NEST (steganographic acrostic CoT evaluation)

Novel technique introduced


Monitoring chain-of-thought (CoT) reasoning is a foundational safety technique for large language model (LLM) agents; however, this oversight is compromised if models learn to conceal their reasoning. We explore the potential for steganographic CoT -- where models hide secret reasoning within innocuous text -- to inform risk assessment and deployment policies. We systematically evaluate the limits of steganographic capabilities across 28 models, ranging from past generations to the current frontier. We measure monitor evasion, refusal rates, encoding fidelity, and hidden task accuracy across four datasets, comparing steganographic acrostics against plain reasoning and filler-token baselines. We find that current models cannot yet sustain hidden reasoning for complex math and arithmetic tasks. However, in a simplified counting experiment, Claude Opus 4.5 achieved 92% accuracy on the hidden task, demonstrating nascent capability. Notably, in rare cases (<1%), GPT-5.2 might refuse steganographic instructions while simultaneously complying with them. Our findings underscore the need for continuous evaluation of steganographic risks. This study provides a methodology to preemptively detect and prevent hidden reasoning that might empower misaligned scheming and deceptive behavior.


Key Contributions

  • Introduces NEST, a systematic evaluation framework measuring four steganographic CoT dimensions (hidden task accuracy, encoding quality, monitor evasion, and translation) across 28 frontier LLMs
  • Demonstrates that current models cannot sustain complex hidden reasoning (arithmetic/GSM8K), but Claude Opus 4.5 achieves 92% accuracy on a simplified counting task, evidencing nascent steganographic capability
  • Provides a methodology for preemptive detection and risk assessment of steganographic reasoning that could enable misaligned scheming and deceptive agent behavior

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_time
Datasets
GSM8KHH-RLHFcustom arithmetic datasetcustom counting dataset
Applications
llm agent safety monitoringchain-of-thought oversight