A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Usman Anwar 1, Julianna Piskorz 1, David D. Baek 2, David Africa 3, Jim Weatherall 4, Max Tegmark 2, Christian Schroeder de Witt 5, Mihaela van der Schaar 1, David Krueger 6
Published on arXiv
2602.23163
Output Integrity Attack
OWASP ML Top 10 — ML09
Excessive Agency
OWASP LLM Top 10 — LLM08
Key Finding
The steganographic gap successfully detects and quantifies steganographic reasoning in LLMs without requiring a reference distribution of non-steganographic signals, and can be used to mitigate such behavior.
Steganographic Gap
Novel technique introduced
Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.
Key Contributions
- Introduces generalized V-information as a utilitarian framework for measuring usable information in a signal, applicable without a known reference distribution
- Defines the 'steganographic gap' — a measure quantifying steganography by comparing downstream utility between agents that can and cannot decode hidden content
- Empirically validates the framework to detect, quantify, and mitigate steganographic reasoning in LLMs, enabling principled AI oversight monitoring
🛡️ Threat Analysis
Steganography in LLM outputs is an output integrity threat — hidden information embedded in model-generated text undermines the trustworthiness and verifiability of outputs. The paper proposes a detection and mitigation framework for this covert channel threat.