Weight space Detection of Backdoors in LoRA Adapters

LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data -- making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model -- making our method data-agnostic. Our method extracts simple statistics -- how concentrated the singular values are, their entropy, and the distribution shape -- and flags adapters that deviate from normal patterns. We evaluate the method on 500 LoRA adapters -- 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset. We achieve 97\% detection accuracy with less than 2\% false positives.

Key Contributions

Data-agnostic backdoor detector that analyzes LoRA weight matrices via SVD spectral statistics (leading singular value, Frobenius norm, energy concentration, spectral entropy, kurtosis) without running the model
Benchmark of 500 LoRA adapters (400 clean, 100 poisoned with rare-token and contextual triggers) for Llama-3.2-3B across 8 instruction/reasoning datasets
Logistic regression classifier over z-score-normalized spectral metrics achieving 97% detection accuracy with under 2% false positives, enabling practical hub-scale screening

🛡️ Threat Analysis

AI Supply Chain Attacks

The explicit threat scenario is poisoned adapters distributed through open model hubs (Hugging Face Hub), making the supply chain the attack vector; the paper's hub-scale screening motivation directly targets this supply chain compromise scenario.

Model Poisoning

Primary contribution is detecting backdoor/trojan injections in LoRA adapter weight matrices — the canonical ML10 threat of hidden, trigger-activated malicious behavior embedded in model parameters.