defense 2026

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Usman Anwar ¹, Julianna Piskorz ¹, David D. Baek ², David Africa ³, Jim Weatherall ⁴, Max Tegmark ², Christian Schroeder de Witt ⁵, Mihaela van der Schaar ¹, David Krueger ⁶

¹ University of Cambridge

² Massachusetts Institute of Technology

³ UK AI Safety Institute

⁴ AstraZeneca

⁵ University of Oxford

⁶ Mila

0 citations

Published on arXiv

2602.23163

Output Integrity Attack

OWASP ML Top 10 — ML09

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

The steganographic gap successfully detects and quantifies steganographic reasoning in LLMs without requiring a reference distribution of non-steganographic signals, and can be used to mitigate such behavior.

Steganographic Gap

Novel technique introduced

Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.

Key Contributions

Introduces generalized V-information as a utilitarian framework for measuring usable information in a signal, applicable without a known reference distribution
Defines the 'steganographic gap' — a measure quantifying steganography by comparing downstream utility between agents that can and cannot decode hidden content
Empirically validates the framework to detect, quantify, and mitigate steganographic reasoning in LLMs, enabling principled AI oversight monitoring

🛡️ Threat Analysis

Output Integrity Attack

Steganography in LLM outputs is an output integrity threat — hidden information embedded in model-generated text undermines the trustworthiness and verifiability of outputs. The paper proposes a detection and mitigation framework for this covert channel threat.

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_time

Applications

llm monitoringai safety oversightmisaligned ai detection

Read PDF arXiv

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Audit the Whisper: Detecting Steganographic Collusion in Multi-Agent LLMs

Optimizing Token Choice for Code Watermarking: An RL Approach

WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models

Detecting Cognitive Signatures in Typing Behavior for Non-Intrusive Authorship Verification

Verifier-Bound Communication for LLM Agents: Certified Bounds on Covert Signaling

Adaptive Testing for Segmenting Watermarked Texts From Language Models

Online LLM watermark detection via e-processes