attack 2025

LLMs can hide text in other text of the same length

Antonio Norelli , Michael Bronstein

University Of Oxford

0 citations · 52 references · arXiv

Published on arXiv

2510.20075

Output Integrity Attack

OWASP ML Top 10 — ML09

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

LLMs can detect a statistical distinction between original and encoded cover texts in most cases, but the Calgacus protocol produces cover texts that are imperceptible to humans and can bypass safety filters by hiding unfiltered LLM outputs inside compliant responses.

Calgacus

Novel technique introduced

A meaningful text can be hidden inside another, completely different yet still coherent and plausible, text of the same length. For example, a tweet containing a harsh political critique could be embedded in a tweet that celebrates the same political leader, or an ordinary product review could conceal a secret manuscript. This uncanny state of affairs is now possible thanks to Large Language Models, and in this paper we present Calgacus, a simple and efficient protocol to achieve it. We show that even modest 8-billion-parameter open-source LLMs are sufficient to obtain high-quality results, and a message as long as this abstract can be encoded and decoded locally on a laptop in seconds. The existence of such a protocol demonstrates a radical decoupling of text from authorial intent, further eroding trust in written communication, already shaken by the rise of LLM chatbots. We illustrate this with a concrete scenario: a company could covertly deploy an unfiltered LLM by encoding its answers within the compliant responses of a safe model. This possibility raises urgent questions for AI safety and challenges our understanding of what it means for a Large Language Model to know something.

Key Contributions

Calgacus protocol: encodes arbitrary text inside a different, equally plausible same-length text using LLM token distributions, with exact recovery via a shared key
Demonstrates that 8B open-source LLMs are sufficient for high-quality steganographic encoding/decoding in seconds on commodity hardware
Identifies concrete security implications: safety filter evasion by embedding unfiltered LLM responses within compliant outputs, and radical decoupling of text from authorial intent

🛡️ Threat Analysis

Output Integrity Attack

Calgacus directly attacks the authenticity and integrity of LLM text outputs: a compliant-looking text secretly encodes entirely different content, fundamentally undermining content provenance and output trustworthiness. The paper's concrete safety scenario — encoding unfiltered LLM responses within a safe model's compliant outputs — is a direct attack on output integrity and content authentication.

Details

Domains

nlpgenerative

Model Types

llm

Threat Tags

black_boxinference_timedigitaltargeted

Datasets

Reddit posts

Applications

content moderation bypasscovert communicationllm safety systems

Read PDF arXiv DOI Code

LLMs can hide text in other text of the same length

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

From Similarity to Vulnerability: Key Collision Attack on LLM Semantic Caching

ChatGPT: Excellent Paper! Accept It. Editor: Imposter Found! Review Rejected

Whose Narrative is it Anyway? A KV Cache Manipulation Attack

Diffusion LLMs are Natural Adversaries for any LLM

Self-HarmLLM: Can Large Language Model Harm Itself?

Automating Agent Hijacking via Structural Template Injection

Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing

A Whole New World: Creating a Parallel-Poisoned Web Only AI-Agents Can See