attack 2025

Whose Narrative is it Anyway? A KV Cache Manipulation Attack

Mukkesh Ganesh , Kaushik Iyer , Arun Baalaaji Sankar Ananthan

University of Southern California

0 citations · 14 references · arXiv

Published on arXiv

2511.12752

Output Integrity Attack

OWASP ML Top 10 — ML09

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Only full-layer KV cache overwrites successfully hijack LLM conversation topics across 324 tested configurations, with partial-layer overwrites consistently failing to override the model's generative trajectory.

History Swapping

Novel technique introduced

The Key Value(KV) cache is an important component for efficient inference in autoregressive Large Language Models (LLMs), but its role as a representation of the model's internal state makes it a potential target for integrity attacks. This paper introduces "History Swapping," a novel block-level attack that manipulates the KV cache to steer model generation without altering the user-facing prompt. The attack involves overwriting a contiguous segment of the active generation's cache with a precomputed cache from a different topic. We empirically evaluate this method across 324 configurations on the Qwen 3 family of models, analyzing the impact of timing, magnitude, and layer depth of the cache overwrite. Our findings reveal that only full-layer overwrites can successfully hijack the conversation's topic, leading to three distinct behaviors: immediate and persistent topic shift, partial recovery, or a delayed hijack. Furthermore, we observe that high-level structural plans are encoded early in the generation process and local discourse structure is maintained by the final layers of the model. This work demonstrates that the KV cache is a significant vector for security analysis, as it encodes not just context but also topic trajectory and structural planning, making it a powerful interface for manipulating model behavior.

Key Contributions

Introduces 'History Swapping', a block-level KV cache overwrite attack that steers LLM generation topic without modifying the user-facing prompt
Evaluates 324 configurations on the Qwen 3 model family, showing only full-layer overwrites successfully hijack topic and revealing three outcome behaviors: immediate shift, partial recovery, or delayed hijack
Reveals that LLMs encode high-level structural plans early in generation (early layers) and maintain local discourse structure in final layers, explaining why partial overwrites fail

🛡️ Threat Analysis

Output Integrity Attack

History Swapping is fundamentally an output integrity attack: an adversary with access to the serving infrastructure corrupts the model's internal state (KV cache) during inference, causing the generated output to diverge from what the user's prompt should produce — a direct integrity violation of the model's generated content.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timetargeted

Datasets

Qwen 3

Applications

llm inference servingconversational aitext generation

Read PDF arXiv DOI

Whose Narrative is it Anyway? A KV Cache Manipulation Attack

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

From Similarity to Vulnerability: Key Collision Attack on LLM Semantic Caching

Paladin: Defending LLM-enabled Phishing Emails with a New Trigger-Tag Paradigm

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation

AIP: Subverting Retrieval-Augmented Generation via Adversarial Instructional Prompt

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

Emoji-Based Jailbreaking of Large Language Models

PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming