defense 2026

Weight space Detection of Backdoors in LoRA Adapters

David Puertolas Merenciano 1, Ekaterina Vasyagina 1, Raghav Dixit 1, Kevin Zhu 1, Ruizhe Li 2, Javier Ferrando , Maheep Chaudhary 3

0 citations · 10 references

α

Published on arXiv

2602.15195

Model Poisoning

OWASP ML Top 10 — ML10

AI Supply Chain Attacks

OWASP ML Top 10 — ML06

Key Finding

Achieves 97% backdoor detection accuracy with under 2% false positives on 500 LoRA adapters by analyzing singular value concentration patterns in weight matrices, with no model execution required.


LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data -- making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model -- making our method data-agnostic. Our method extracts simple statistics -- how concentrated the singular values are, their entropy, and the distribution shape -- and flags adapters that deviate from normal patterns. We evaluate the method on 500 LoRA adapters -- 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset. We achieve 97\% detection accuracy with less than 2\% false positives.


Key Contributions

  • Data-agnostic backdoor detector that analyzes LoRA weight matrices via SVD spectral statistics (leading singular value, Frobenius norm, energy concentration, spectral entropy, kurtosis) without running the model
  • Benchmark of 500 LoRA adapters (400 clean, 100 poisoned with rare-token and contextual triggers) for Llama-3.2-3B across 8 instruction/reasoning datasets
  • Logistic regression classifier over z-score-normalized spectral metrics achieving 97% detection accuracy with under 2% false positives, enabling practical hub-scale screening

🛡️ Threat Analysis

AI Supply Chain Attacks

The explicit threat scenario is poisoned adapters distributed through open model hubs (Hugging Face Hub), making the supply chain the attack vector; the paper's hub-scale screening motivation directly targets this supply chain compromise scenario.

Model Poisoning

Primary contribution is detecting backdoor/trojan injections in LoRA adapter weight matrices — the canonical ML10 threat of hidden, trigger-activated malicious behavior embedded in model parameters.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargeted
Datasets
AlpacaDollyGSM8KARC-ChallengeSQuADv2NaturalQuestionsHumanEvalGLUE
Applications
llm fine-tuningmodel hub screeninglora adapter distribution