Causal-Guided Detoxify Backdoor Attack of Open-Weight LoRA Models

Low-Rank Adaptation (LoRA) has emerged as an efficient method for fine-tuning large language models (LLMs) and is widely adopted within the open-source community. However, the decentralized dissemination of LoRA adapters through platforms such as Hugging Face introduces novel security vulnerabilities: malicious adapters can be easily distributed and evade conventional oversight mechanisms. Despite these risks, backdoor attacks targeting LoRA-based fine-tuning remain relatively underexplored. Existing backdoor attack strategies are ill-suited to this setting, as they often rely on inaccessible training data, fail to account for the structural properties unique to LoRA, or suffer from high false trigger rates (FTR), thereby compromising their stealth. To address these challenges, we propose Causal-Guided Detoxify Backdoor Attack (CBA), a novel backdoor attack framework specifically designed for open-weight LoRA models. CBA operates without access to original training data and achieves high stealth through two key innovations: (1) a coverage-guided data generation pipeline that synthesizes task-aligned inputs via behavioral exploration, and (2) a causal-guided detoxification strategy that merges poisoned and clean adapters by preserving task-critical neurons. Unlike prior approaches, CBA enables post-training control over attack intensity through causal influence-based weight allocation, eliminating the need for repeated retraining. Evaluated across six LoRA models, CBA achieves high attack success rates while reducing FTR by 50-70\% compared to baseline methods. Furthermore, it demonstrates enhanced resistance to state-of-the-art backdoor defenses, highlighting its stealth and robustness.

Key Contributions

Coverage-guided data generation pipeline that synthesizes task-aligned poisoned inputs without access to the original training data
Causal-guided detoxification strategy that merges poisoned and clean LoRA adapters by preserving task-critical neurons, reducing false trigger rates by 50–70%
Post-training control over attack intensity via causal influence-based weight allocation, eliminating the need for repeated retraining

🛡️ Threat Analysis

Transfer Learning Attack

The attack explicitly targets and exploits the LoRA transfer learning process: it merges poisoned and clean adapters, leverages structural properties unique to LoRA, and is specifically designed around adapter-based fine-tuning — matching ML07's 'Adapter/LoRA trojans' subcategory.

Model Poisoning

CBA is a backdoor/trojan attack that embeds hidden targeted malicious behavior in LoRA adapters, activating only on specific triggers while maintaining normal task performance otherwise — the canonical ML10 threat.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timetargeteddigitalwhite_box

Applications

llm fine-tuninglora adapter deploymentopen-source model hubs

Read PDF arXiv DOI

Similar Papers

attack

Causal-Guided Detoxify Backdoor Attack of Open-Weight LoRA Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

DecepChain: Inducing Deceptive Reasoning in Large Language Models

Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs

Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation

bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

Persistent Backdoor Attacks under Continual Fine-Tuning of LLMs

Adversarial Contrastive Learning for LLM Quantization Attacks

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography