attack 2025

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Kazuki Egashira , Robin Staab , Thibaud Gloaguen , Mark Vero , Martin Vechev

3 citations · 45 references · arXiv

α

Published on arXiv

2510.07985

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

After applying standard LLM pruning (Magnitude, Wanda, SparseGPT via vLLM), the malicious model achieves up to 95.7% jailbreak success, 98.7% benign instruction refusal, and 99.5% targeted content injection across five evaluated models.

Pruning-Activated Attack

Novel technique introduced


Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to conveniently prune downloaded models before they are deployed. While the utility and efficiency of pruning methods have improved significantly, the security implications of pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to $95.7\%$ for jailbreak, $98.7\%$ for benign instruction refusal, and $99.5\%$ for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.


Key Contributions

  • First demonstration that common LLM pruning methods (Magnitude, Wanda, SparseGPT in vLLM) can be maliciously exploited via pruning-activated backdoors
  • Novel technique estimating per-parameter pruning probability to inject malicious behavior into non-prunable weights, then cancel its visible effect using pruneable weights that disappear after compression
  • Extensive evaluation on five LLMs showing attack success rates of up to 95.7% jailbreak, 98.7% instruction refusal, and 99.5% targeted content injection across all tested pruning methods

🛡️ Threat Analysis

Model Poisoning

Embeds hidden malicious behaviors (jailbreak, instruction refusal, content injection) into model weights that activate only when pruning is applied — the pruning operation acts as the 'trigger' in this backdoor/trojan attack. The adversary computes a proxy metric to estimate per-parameter pruning probability, injects malicious behavior into non-prunable weights, then masks it using pruneable weights that are later removed. This is a novel backdoor injection technique; the supply chain context is motivation, not the primary contribution.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxtraining_timetargeteddigital
Applications
llm inferencemodel compressionllm deployment pipelines