defense 2025

AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs

Debdeep Sanyal , Manodeep Ray , Murari Mandal

0 citations

α

Published on arXiv

2509.08000

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

AntiDote achieves up to 27.4% greater robustness against adversarial fine-tuning attacks versus tamper-resistance and unlearning baselines while incurring less than 0.5% capability degradation

AntiDote

Novel technique introduced


The release of open-weight large language models (LLMs) creates a tension between advancing accessible research and preventing misuse, such as malicious fine-tuning to elicit harmful content. Current safety measures struggle to preserve the general capabilities of the LLM while resisting a determined adversary with full access to the model's weights and architecture, who can use full-parameter fine-tuning to erase existing safeguards. To address this, we introduce AntiDote, a bi-level optimization procedure for training LLMs to be resistant to such tampering. AntiDote involves an auxiliary adversary hypernetwork that learns to generate malicious Low-Rank Adaptation (LoRA) weights conditioned on the defender model's internal activations. The defender LLM is then trained with an objective to nullify the effect of these adversarial weight additions, forcing it to maintain its safety alignment. We validate this approach against a diverse suite of 52 red-teaming attacks, including jailbreak prompting, latent space manipulation, and direct weight-space attacks. AntiDote is upto 27.4\% more robust against adversarial attacks compared to both tamper-resistance and unlearning baselines. Crucially, this robustness is achieved with a minimal trade-off in utility, incurring a performance degradation of upto less than 0.5\% across capability benchmarks including MMLU, HellaSwag, and GSM8K. Our work offers a practical and compute efficient methodology for building open-weight models where safety is a more integral and resilient property.


Key Contributions

  • AntiDote: a bi-level optimization procedure that trains LLMs to resist malicious fine-tuning by adversarially simulating LoRA-based tampering attacks during safety training
  • Auxiliary adversary hypernetwork that generates conditioned malicious LoRA weight perturbations based on the defender model's internal activations, enabling realistic worst-case simulation
  • Demonstrated 27.4% robustness improvement over tamper-resistance and unlearning baselines across 52 red-teaming attacks with less than 0.5% utility degradation on MMLU, HellaSwag, and GSM8K

🛡️ Threat Analysis

Transfer Learning Attack

The primary threat model is an adversary who exploits fine-tuning (LoRA or full-parameter) to erase safety guardrails in open-weight LLMs. AntiDote defends against this by simulating malicious LoRA weight modifications via a hypernetwork adversary during training — squarely a transfer learning / fine-tuning exploitation defense.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxtraining_time
Datasets
MMLUHellaSwagGSM8K
Applications
open-weight llm safetysafety alignment robustness