defense 2026

Why LoRA Fails to Forget: Regularized Low-Rank Adaptation Against Backdoors in Language Models

Hoang-Chau Luong , Lingwei Chen

1 citations · 37 references · arXiv

α

Published on arXiv

2601.06305

Model Poisoning

OWASP ML Top 10 — ML10

Transfer Learning Attack

OWASP ML Top 10 — ML07

Key Finding

RoRA substantially reduces backdoor attack success rates across multiple NLP benchmarks and attack settings while maintaining clean task accuracy, outperforming standard LoRA fine-tuning as a backdoor defense.

RoRA (Regularized Low-Rank Adaptation)

Novel technique introduced


Low-Rank Adaptation (LoRA) is widely used for parameter-efficient fine-tuning of large language models, but it is notably ineffective at removing backdoor behaviors from poisoned pretrained models when fine-tuning on clean dataset. Contrary to the common belief that this weakness is caused primarily by low rank, we show that LoRA's vulnerability is fundamentally spectral. Our analysis identifies two key factors: LoRA updates (i) possess insufficient spectral strength, with singular values far below those of pretrained weights, and (ii) exhibit unfavorable spectral alignment, weakly matching clean-task directions while retaining overlap with trigger-sensitive subspaces. We further establish a critical scaling threshold beyond which LoRA can theoretically suppress trigger-induced activations, and we show empirically that standard LoRA rarely reaches this regime. We introduce Regularized Low-Rank Adaptation (RoRA), which improves forgetting by increasing spectral strength and correcting alignment through clean-strengthened regularization, trigger-insensitive constraints, and post-training spectral rescaling. Experiments across multiple NLP benchmarks and attack settings show that RoRA substantially reduces attack success rates while maintaining clean accuracy.


Key Contributions

  • Spectral analysis showing LoRA's failure to forget backdoors stems from insufficient spectral strength and unfavorable alignment with trigger-sensitive subspaces, not low rank per se
  • Theoretical proof of a critical scaling threshold above which LoRA updates can suppress trigger-induced activations, with empirical evidence that standard LoRA rarely reaches this threshold
  • RoRA (Regularized Low-Rank Adaptation): a defense combining clean-strengthened regularization, trigger-insensitive orthogonality constraints, and post-training spectral rescaling to substantially reduce attack success rates while preserving clean accuracy

🛡️ Threat Analysis

Transfer Learning Attack

The paper specifically targets the transfer learning scenario: backdoors that survive LoRA/adapter fine-tuning. The threat model, analysis, and defense (RoRA) are all centered on how backdoors persist through the fine-tuning process, which is the core of ML07 — 'backdoors that survive fine-tuning' and 'adapter/LoRA trojans'.

Model Poisoning

Paper's primary focus is on backdoor attacks embedded in pretrained LLMs and why fine-tuning fails to remove trigger-induced behaviors — directly the backdoor/trojan threat. RoRA is proposed as a defense to suppress backdoor activation during LoRA-based fine-tuning.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargeted
Applications
language model fine-tuningnlp classification