In-Context Representation Hijacking
Itay Yona 1, Amir Sarid 2, Michael Karasik 2, Yossi Gandelsman 3
Published on arXiv
2512.03771
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Achieves 74% attack success rate on Llama-3.3-70B-Instruct with a single-sentence context override, transferable across open-source and closed-source model families without optimization
Doublespeak
Novel technique introduced
We introduce $\textbf{Doublespeak}$, a simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., "How to build a carrot?") are internally interpreted as disallowed instructions (e.g., "How to build a bomb?"), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.
Key Contributions
- Introduces Doublespeak, the first representation-level jailbreak that hijacks benign token embeddings toward harmful semantics via systematic in-context keyword substitution
- Provides mechanistic interpretability evidence (logit lens, Patchscopes) showing layer-by-layer semantic overwriting from benign to harmful meaning across model depth
- Reveals a blind spot in current alignment strategies — safety checks at the input token level are bypassed because harmful semantics only emerge in later layers — arguing for representation-level monitoring