Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning
Weitao Feng 1, Lixu Wang 1, Tianyi Wei 1, Jie Zhang 2, Chongyang Gao 3, Sinong Zhan 1, Peizhuo Lv 1, Wei Dong 1
Published on arXiv
2508.20697
Transfer Learning Attack
OWASP ML Top 10 — ML07
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
TokenBuncher robustly mitigates RL-based harmful fine-tuning across multiple models and RL algorithms while preserving benign task performance and finetunability.
TokenBuncher
Novel technique introduced
As large language models (LLMs) continue to grow in capability, so do the risks of harmful misuse through fine-tuning. While most prior studies assume that attackers rely on supervised fine-tuning (SFT) for such misuse, we systematically demonstrate that reinforcement learning (RL) enables adversaries to more effectively break safety alignment and facilitate more advanced harmful task assistance, under matched computational budgets. To counter this emerging threat, we propose TokenBuncher, the first effective defense specifically targeting RL-based harmful fine-tuning. TokenBuncher suppresses the foundation on which RL relies: model response entropy. By constraining entropy, RL-based fine-tuning can no longer exploit distinct reward signals to drive the model toward harmful behaviors. We realize this defense through entropy-as-reward RL and a Token Noiser mechanism designed to prevent the escalation of harmful capabilities. Extensive experiments across multiple models and RL algorithms show that TokenBuncher robustly mitigates harmful RL fine-tuning while preserving benign task performance and finetunability. Our results highlight that RL-based harmful fine-tuning poses a greater systemic risk than SFT, and that TokenBuncher provides an effective and general defense.
Key Contributions
- Systematic demonstration that RL-based harmful fine-tuning is more effective at breaking safety alignment than SFT under matched compute budgets
- TokenBuncher defense that constrains model response entropy to prevent RL from exploiting distinct reward signals for harmful behavior
- Token Noiser mechanism paired with entropy-as-reward RL to block escalation of harmful capabilities while preserving benign performance
🛡️ Threat Analysis
The threat model is an adversary exploiting RL fine-tuning to break pre-trained safety alignment — directly targets the transfer/fine-tuning process. TokenBuncher is the first defense specifically against RL-based harmful fine-tuning, fitting ML07's 'RLHF/preference manipulation to embed malicious behavior' and 'attacks exploiting the pre-training to fine-tuning gap'.