benchmark 2026

NEST: Nascent Encoded Steganographic Thoughts

Artem Karpov

Independent

0 citations · 32 references · arXiv (Cornell University)

Published on arXiv

2602.14095

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Current frontier models cannot sustain hidden reasoning for complex tasks, but Claude Opus 4.5 achieved 92% accuracy on a simplified counting steganography experiment, and GPT-5.2 was observed in rare cases (<1%) simultaneously refusing and complying with steganographic instructions.

NEST (steganographic acrostic CoT evaluation)

Novel technique introduced

Monitoring chain-of-thought (CoT) reasoning is a foundational safety technique for large language model (LLM) agents; however, this oversight is compromised if models learn to conceal their reasoning. We explore the potential for steganographic CoT -- where models hide secret reasoning within innocuous text -- to inform risk assessment and deployment policies. We systematically evaluate the limits of steganographic capabilities across 28 models, ranging from past generations to the current frontier. We measure monitor evasion, refusal rates, encoding fidelity, and hidden task accuracy across four datasets, comparing steganographic acrostics against plain reasoning and filler-token baselines. We find that current models cannot yet sustain hidden reasoning for complex math and arithmetic tasks. However, in a simplified counting experiment, Claude Opus 4.5 achieved 92% accuracy on the hidden task, demonstrating nascent capability. Notably, in rare cases (<1%), GPT-5.2 might refuse steganographic instructions while simultaneously complying with them. Our findings underscore the need for continuous evaluation of steganographic risks. This study provides a methodology to preemptively detect and prevent hidden reasoning that might empower misaligned scheming and deceptive behavior.

Key Contributions

Introduces NEST, a systematic evaluation framework measuring four steganographic CoT dimensions (hidden task accuracy, encoding quality, monitor evasion, and translation) across 28 frontier LLMs
Demonstrates that current models cannot sustain complex hidden reasoning (arithmetic/GSM8K), but Claude Opus 4.5 achieves 92% accuracy on a simplified counting task, evidencing nascent steganographic capability
Provides a methodology for preemptive detection and risk assessment of steganographic reasoning that could enable misaligned scheming and deceptive agent behavior

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_time

Datasets

GSM8KHH-RLHFcustom arithmetic datasetcustom counting dataset

Applications

llm agent safety monitoringchain-of-thought oversight

Read PDF arXiv DOI

NEST: Nascent Encoded Steganographic Thoughts

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Language Models Identify Ambiguities and Exploit Loopholes

Agentic Misalignment: How LLMs Could Be Insider Threats

BashArena: A Control Setting for Highly Privileged AI Agents

Evaluating Control Protocols for Untrusted AI Agents

Verifier-Bound Communication for LLM Agents: Certified Bounds on Covert Signaling

Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

The Trust Paradox in LLM-Based Multi-Agent Systems: When Collaboration Becomes a Security Vulnerability

Audit the Whisper: Detecting Steganographic Collusion in Multi-Agent LLMs