Why LoRA Fails to Forget: Regularized Low-Rank Adaptation Against Backdoors in Language Models

Low-Rank Adaptation (LoRA) is widely used for parameter-efficient fine-tuning of large language models, but it is notably ineffective at removing backdoor behaviors from poisoned pretrained models when fine-tuning on clean dataset. Contrary to the common belief that this weakness is caused primarily by low rank, we show that LoRA's vulnerability is fundamentally spectral. Our analysis identifies two key factors: LoRA updates (i) possess insufficient spectral strength, with singular values far below those of pretrained weights, and (ii) exhibit unfavorable spectral alignment, weakly matching clean-task directions while retaining overlap with trigger-sensitive subspaces. We further establish a critical scaling threshold beyond which LoRA can theoretically suppress trigger-induced activations, and we show empirically that standard LoRA rarely reaches this regime. We introduce Regularized Low-Rank Adaptation (RoRA), which improves forgetting by increasing spectral strength and correcting alignment through clean-strengthened regularization, trigger-insensitive constraints, and post-training spectral rescaling. Experiments across multiple NLP benchmarks and attack settings show that RoRA substantially reduces attack success rates while maintaining clean accuracy.

Key Contributions

Spectral analysis showing LoRA's failure to forget backdoors stems from insufficient spectral strength and unfavorable alignment with trigger-sensitive subspaces, not low rank per se
Theoretical proof of a critical scaling threshold above which LoRA updates can suppress trigger-induced activations, with empirical evidence that standard LoRA rarely reaches this threshold
RoRA (Regularized Low-Rank Adaptation): a defense combining clean-strengthened regularization, trigger-insensitive orthogonality constraints, and post-training spectral rescaling to substantially reduce attack success rates while preserving clean accuracy

🛡️ Threat Analysis

Transfer Learning Attack

The paper specifically targets the transfer learning scenario: backdoors that survive LoRA/adapter fine-tuning. The threat model, analysis, and defense (RoRA) are all centered on how backdoors persist through the fine-tuning process, which is the core of ML07 — 'backdoors that survive fine-tuning' and 'adapter/LoRA trojans'.

Model Poisoning

Paper's primary focus is on backdoor attacks embedded in pretrained LLMs and why fine-tuning fails to remove trigger-induced behaviors — directly the backdoor/trojan threat. RoRA is proposed as a defense to suppress backdoor activation during LoRA-based fine-tuning.