benchmark 2025

Merge Now, Regret Later: The Hidden Cost of Model Merging is Adversarial Transferability

Ankit Gangwal , Aaryan Ajay Sharma

1 citations · 71 references · arXiv

α

Published on arXiv

2509.23689

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Merged models are highly vulnerable to black-box transfer attacks with over 95% relative attack success rate, and stronger model merging methods paradoxically increase this vulnerability


Model Merging (MM) has emerged as a promising alternative to multi-task learning, where multiple fine-tuned models are combined, without access to tasks' training data, into a single model that maintains performance across tasks. Recent works have explored the impact of MM on adversarial attacks, particularly backdoor attacks. However, none of them have sufficiently explored its impact on transfer attacks using adversarial examples, i.e., a black-box adversarial attack where examples generated for a surrogate model successfully mislead a target model. In this work, we study the effect of MM on the transferability of adversarial examples. We perform comprehensive evaluations and statistical analysis consisting of 8 MM methods, 7 datasets, and 6 attack methods, sweeping over 336 distinct attack settings. Through it, we first challenge the prevailing notion of MM conferring free adversarial robustness, and show MM cannot reliably defend against transfer attacks, with over 95% relative transfer attack success rate. Moreover, we reveal 3 key insights for machine-learning practitioners regarding MM and transferability for a robust system design: (1) stronger MM methods increase vulnerability to transfer attacks; (2) mitigating representation bias increases vulnerability to transfer attacks; and (3) weight averaging, despite being the weakest MM method, is the most vulnerable MM method to transfer attacks. Finally, we analyze the underlying reasons for this increased vulnerability, and provide potential solutions to the problem. Our findings offer critical insights for designing more secure systems employing MM.


Key Contributions

  • Challenges the prevailing assumption that model merging provides free adversarial robustness, demonstrating >95% relative transfer attack success rate against merged models
  • Statistically validates three key findings across 336 settings (8 MM methods × 7 datasets × 6 attacks): stronger MM methods increase transfer vulnerability, reducing representation bias increases vulnerability, and weight averaging is most vulnerable despite being the weakest MM method
  • Analyzes the underlying mechanisms driving increased vulnerability in merged models and proposes potential mitigations for robust system design

🛡️ Threat Analysis

Input Manipulation Attack

The paper centers on adversarial example transferability — black-box transfer attacks where adversarial inputs crafted on a surrogate model successfully mislead merged target models at inference time, evaluated across 6 attack methods and 8 merging strategies.


Details

Domains
vision
Model Types
transformercnn
Threat Tags
black_boxinference_timedigital
Datasets
7 image classification datasets (unspecified in excerpt)
Applications
image classificationmulti-task learningmlaas