attack 2025

CacheTrap: Injecting Trojans in LLMs without Leaving any Traces in Inputs or Weights

Mohaiminul Al Nahian ¹, Abeer Matar A. Almalky ¹, Gamana Aragonda ², Ranyang Zhou ², Sabbir Ahmed ¹, Dmitry Ponomarev ¹, Li Yang ³, Shaahin Angizi ², Adnan Siraj Rakin ¹

¹ SUNY Binghamton

² New Jersey Institute of Technology

³ UNC Charlotte

0 citations · 47 references · arXiv

Published on arXiv

2511.22681

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Achieves the first successful Trojan attack on LLMs via a single bit-flip in the KV cache, requiring no training data or gradients while maintaining full model utility on clean inputs across five benchmark LLMs.

CacheTrap

Novel technique introduced

Adversarial weight perturbation has emerged as a concerning threat to LLMs that either use training privileges or system-level access to inject adversarial corruption in model weights. With the emergence of innovative defensive solutions that place system- and algorithm-level checks and corrections in the input and weight spaces, these perturbations are increasingly susceptible to defenses. This work develops a novel perspective on Trojan attacks that generates an attacker-designed model output while leaving no attack traces on the inputs or weights. Such an attack space can be unlocked through corruption of the key-value (KV) cache. In this paper, we introduce CacheTrap, a novel Trojan attack that corrupts the value vectors stored in the KV cache. These vectors capture the dynamic activations for specific token positions and therefore constitute a natural surface for transient, inference-time trigger insertion. The transient nature of these KV values and their dependence on victim input imply additional constraints on our attack, such as a lack of knowledge of the victim's data or domain application, and, consequently, a lack of gradient information. The objective of the proposed CacheTrap is to develop a vulnerable KV bit-searching algorithm so that, once the attack employs the identified bit-flip as a trigger, the model generates targeted behavior, e.g., classifying inputs towards the target class. Moreover, CacheTrap is a data- and gradient-free attack which also has no impact on the model's utility. Our evaluation demonstrates that the proposed attack enables the first successful Trojan attack on LLMs with a single bit flip in the KV cache. In addition, the data-independent nature of the attack ensures that once the attacker identifies the vulnerable bit index, the location remains constant and can be transferred to a wide range of victim tasks/datasets/queries with no overhead.

Key Contributions

Introduces KV cache as a novel, untraceable attack surface for Trojan injection in LLMs — leaving no traces in inputs or model weights
Proposes a data-free, gradient-free vulnerable KV bit-searching algorithm that identifies a single bit-flip sufficient to trigger targeted behavior
Demonstrates the first successful single-bit Trojan attack on LLMs that transfers across tasks, datasets, and queries with zero overhead

🛡️ Threat Analysis

Model Poisoning

CacheTrap is explicitly a Trojan attack: it embeds hidden, targeted malicious behavior (e.g., misclassification toward a target class) that activates only via a specific trigger (a single identified bit-flip in the KV cache) while the model behaves normally otherwise. Fits ML10 perfectly — hidden trigger, targeted behavior, no utility drop on clean inputs.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeteddigital

Applications

large language modelstext classificationllm inference

Read PDF arXiv DOI

CacheTrap: Injecting Trojans in LLMs without Leaving any Traces in Inputs or Weights

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SilentStriker:Toward Stealthy Bit-Flip Attacks on Large Language Models

Has the Two-Decade-Old Prophecy Come True? Artificial Bad Intelligence Triggered by Merely a Single-Bit Flip in Large Language Models

ShadowLogic: Backdoors in Any Whitebox LLM

COBRA: Catastrophic Bit-flip Reliability Analysis of State-Space Models

TFL: Targeted Bit-Flip Attack on Large Language Model

Adversarial Contrastive Learning for LLM Quantization Attacks

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

The Achilles' Heel of LLMs: How Altering a Handful of Neurons Can Cripple Language Abilities