attack 2025

The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

Xiaoze Liu ¹, Weichen Yu ², Matt Fredrikson ², Xiaoqian Wang ¹, Jing Gao ¹

¹ Purdue University

² Carnegie Mellon University

1 citations · 98 references · arXiv

Published on arXiv

2601.00065

AI Supply Chain Attacks

OWASP ML Top 10 — ML06

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

A single engineered breaker token sabotages the base model's generation after tokenizer transplant while leaving donor utility statistically indistinguishable from nominal, with the attack remaining training-free and persistent against fine-tuning and weight merging.

Breaker Token Attack (TokenForge)

Novel technique introduced

The open-weight language model ecosystem is increasingly defined by model composition techniques (such as weight merging, speculative decoding, and vocabulary expansion) that remix capabilities from diverse sources. A critical prerequisite for applying these methods across different model families is tokenizer transplant, which aligns incompatible vocabularies to a shared embedding space. We demonstrate that this essential interoperability step introduces a supply-chain vulnerability: we engineer a single breaker token that is functionally inert in a donor model yet reliably reconstructs into a high-salience malicious feature after transplant into a base model. By exploiting the geometry of coefficient reuse, our attack sabotages the base model's generation while leaving the donor's utility statistically indistinguishable from nominal behavior. We formalize this as a dual-objective optimization problem and instantiate the attack using a sparse solver. Empirically, the attack is training-free and evades outlier detection, while demonstrating structural persistence against fine-tuning and weight merging, highlighting a hidden risk in the pipeline of modular AI composition. Code is available at https://github.com/xz-liu/tokenforge

Key Contributions

Identifies and formalizes a supply-chain vulnerability in tokenizer transplant, a prerequisite step for LLM composition across model families
Engineers a 'breaker token' that is statistically indistinguishable from benign tokens in the donor model but reconstructs into a high-salience malicious feature post-transplant, exploiting coefficient reuse geometry
Instantiates the attack as a training-free dual-objective sparse optimization and demonstrates evasion of outlier detection and persistence against fine-tuning and weight merging

🛡️ Threat Analysis

AI Supply Chain Attacks

The paper's primary contribution is a supply-chain vulnerability in the tokenizer transplant step of LLM composition pipelines. The malicious feature is inert in the donor model and only manifests during the supply-chain interoperability step — the supply-chain process itself is the attack vector, not merely a motivation.

Model Poisoning

The attack embeds a hidden trojan (the 'breaker token') whose malicious behavior only activates under specific conditions (post-transplant), and demonstrates structural persistence against fine-tuning and weight merging — characteristic backdoor/trojan behavior.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timetargeteddigital

Applications

llm compositionweight mergingspeculative decodingvocabulary expansion

Read PDF arXiv DOI Code

The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain

Weight space Detection of Backdoors in LoRA Adapters

Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates

Semantically-Equivalent Transformations-Based Backdoor Attacks against Neural Code Models: Characterization and Mitigation

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Delayed Backdoor Attacks: Exploring the Temporal Dimension as a New Attack Surface in Pre-Trained Models

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Adversarial Contrastive Learning for LLM Quantization Attacks