Slow Tuning and Low-Entropy Masking for Safe Chain-of-Thought Distillation
Ziyang Ma 1, Qingyue Yuan 2, Linhai Zhang 3, Deyu Zhou 1
Published on arXiv
2508.09666
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
SLowED maintains near-baseline safety ratios on AdvBench across Qwen2.5-1.5B, Llama-3.2-1B, and BLOOM-1.1B while achieving reasoning improvements comparable to standard CoT distillation baselines that severely degrade safety.
SLowED
Novel technique introduced
Previous chain-of-thought (CoT) distillation methods primarily focused on enhancing the reasoning capabilities of Small Language Models (SLMs) by utilizing high-quality rationales generated by powerful Large Language Models (LLMs, e.g., GPT-4). However, few works have noted the negative effects on SLM safety brought by the training, which are revealed in this study. Although there are works on safety alignment that fine-tune language models or manipulate model weights to defend against harmful inputs, they require extra computation or annotated data, and probably impact the reasoning ability of SLMs. In this paper, we investigate how to maintain the safety of SLMs during the CoT distillation process. Specifically, we propose a safe distillation method, Slow Tuning and Low-Entropy Masking Distillation (SLowED), containing two modules: Slow Tuning and Low-Entropy Masking. Slow Tuning scales down the magnitude of model weight changes to optimize the model weights in the neighboring space near the initial weight distribution. Low-Entropy Masking masks low-entropy tokens, which are regarded as unnecessary learning targets, to exclude them from fine-tuning. Experiments on three SLMs (Qwen2.5-1.5B, Llama-3.2-1B, BLOOM-1.1B) across reasoning benchmarks (BBH, BB-Sub, ARC, AGIEval) and safety evaluation (AdvBench) show that SLowED retains the safety of SLMs and comparably improves their reasoning capability compared to existing distillation methods. Furthermore, our ablation study presents the effectiveness of Slow Tuning and Low-Entropy Masking, with the former maintaining the model's safety in the early stage and the latter prolonging the safe training epochs.
Key Contributions
- Reveals that CoT distillation on LLM-generated rationales significantly degrades safety alignment in SLMs, a largely unnoticed phenomenon in prior distillation literature.
- Proposes Slow Tuning, which epoch-wise scales model weights back toward the initial distribution to keep parameters in the safe neighboring space.
- Proposes Low-Entropy Masking, which excludes high-confidence (low-entropy) tokens from loss computation to avoid overwriting safety-related representations.