Why LoRA Fails to Forget: Regularized Low-Rank Adaptation Against Backdoors in Language Models
Hoang-Chau Luong , Lingwei Chen
Published on arXiv
2601.06305
Model Poisoning
OWASP ML Top 10 — ML10
Transfer Learning Attack
OWASP ML Top 10 — ML07
Key Finding
RoRA substantially reduces backdoor attack success rates across multiple NLP benchmarks and attack settings while maintaining clean task accuracy, outperforming standard LoRA fine-tuning as a backdoor defense.
RoRA (Regularized Low-Rank Adaptation)
Novel technique introduced
Low-Rank Adaptation (LoRA) is widely used for parameter-efficient fine-tuning of large language models, but it is notably ineffective at removing backdoor behaviors from poisoned pretrained models when fine-tuning on clean dataset. Contrary to the common belief that this weakness is caused primarily by low rank, we show that LoRA's vulnerability is fundamentally spectral. Our analysis identifies two key factors: LoRA updates (i) possess insufficient spectral strength, with singular values far below those of pretrained weights, and (ii) exhibit unfavorable spectral alignment, weakly matching clean-task directions while retaining overlap with trigger-sensitive subspaces. We further establish a critical scaling threshold beyond which LoRA can theoretically suppress trigger-induced activations, and we show empirically that standard LoRA rarely reaches this regime. We introduce Regularized Low-Rank Adaptation (RoRA), which improves forgetting by increasing spectral strength and correcting alignment through clean-strengthened regularization, trigger-insensitive constraints, and post-training spectral rescaling. Experiments across multiple NLP benchmarks and attack settings show that RoRA substantially reduces attack success rates while maintaining clean accuracy.
Key Contributions
- Spectral analysis showing LoRA's failure to forget backdoors stems from insufficient spectral strength and unfavorable alignment with trigger-sensitive subspaces, not low rank per se
- Theoretical proof of a critical scaling threshold above which LoRA updates can suppress trigger-induced activations, with empirical evidence that standard LoRA rarely reaches this threshold
- RoRA (Regularized Low-Rank Adaptation): a defense combining clean-strengthened regularization, trigger-insensitive orthogonality constraints, and post-training spectral rescaling to substantially reduce attack success rates while preserving clean accuracy
🛡️ Threat Analysis
The paper specifically targets the transfer learning scenario: backdoors that survive LoRA/adapter fine-tuning. The threat model, analysis, and defense (RoRA) are all centered on how backdoors persist through the fine-tuning process, which is the core of ML07 — 'backdoors that survive fine-tuning' and 'adapter/LoRA trojans'.
Paper's primary focus is on backdoor attacks embedded in pretrained LLMs and why fine-tuning fails to remove trigger-induced behaviors — directly the backdoor/trojan threat. RoRA is proposed as a defense to suppress backdoor activation during LoRA-based fine-tuning.