attack 2026

Adversarial Contrastive Learning for LLM Quantization Attacks

Dinghong Song 1, Zhiwei Xu 2, Hai Wan 2, Xibin Zhao 2, Pengfei Su 1, Dong Li

1 citations · 42 references · arXiv

α

Published on arXiv

2601.02680

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

ACL achieves jailbreak ASR of 97.69%, over-refusal ASR of 86.00%, and ad-injection ASR of 92.40%, surpassing prior SOTA by up to 50.80% across three attack scenarios on four LLMs.

Adversarial Contrastive Learning (ACL)

Novel technique introduced


Model quantization is critical for deploying large language models (LLMs) on resource-constrained hardware, yet recent work has revealed severe security risks that benign LLMs in full precision may exhibit malicious behaviors after quantization. In this paper, we propose Adversarial Contrastive Learning (ACL), a novel gradient-based quantization attack that achieves superior attack effectiveness by explicitly maximizing the gap between benign and harmful responses probabilities. ACL formulates the attack objective as a triplet-based contrastive loss, and integrates it with a projected gradient descent two-stage distributed fine-tuning strategy to ensure stable and efficient optimization. Extensive experiments demonstrate ACL's remarkable effectiveness, achieving attack success rates of 86.00% for over-refusal, 97.69% for jailbreak, and 92.40% for advertisement injection, substantially outperforming state-of-the-art methods by up to 44.67%, 18.84%, and 50.80%, respectively.


Key Contributions

  • Triplet-based adversarial contrastive loss that explicitly maximizes the probability gap between benign and harmful LLM responses, improving attack effectiveness over prior fine-tuning objectives
  • Two-stage distributed fine-tuning pipeline: FSDP-based harmful behavior injection followed by PGD with a synchronized AllGather–Clamp–Scatter (ACS) mechanism for scalable, multi-device removal under quantization constraints
  • Empirical demonstration of 86.00% over-refusal, 97.69% jailbreak, and 92.40% advertisement-injection attack success rates on four LLMs, outperforming SOTA by up to 50.80%

🛡️ Threat Analysis

Model Poisoning

ACL injects hidden malicious behaviors (jailbreak, over-refusal, ad injection) into LLM weights via fine-tuning; these behaviors are dormant in full precision and activated by the quantization 'trigger' — a textbook backdoor/trojan attack. Per guidelines, since the paper's primary contribution is the backdoor injection technique (ACL fine-tuning with triplet contrastive loss + PGD-ACS), not a supply chain compromise method, ML10 is the correct single tag even though HuggingFace distribution is the stated threat scenario.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxtraining_timetargeteddigital
Applications
llm quantizationresource-constrained llm deployment