Merge Now, Regret Later: The Hidden Cost of Model Merging is Adversarial Transferability

Model Merging (MM) has emerged as a promising alternative to multi-task learning, where multiple fine-tuned models are combined, without access to tasks' training data, into a single model that maintains performance across tasks. Recent works have explored the impact of MM on adversarial attacks, particularly backdoor attacks. However, none of them have sufficiently explored its impact on transfer attacks using adversarial examples, i.e., a black-box adversarial attack where examples generated for a surrogate model successfully mislead a target model. In this work, we study the effect of MM on the transferability of adversarial examples. We perform comprehensive evaluations and statistical analysis consisting of 8 MM methods, 7 datasets, and 6 attack methods, sweeping over 336 distinct attack settings. Through it, we first challenge the prevailing notion of MM conferring free adversarial robustness, and show MM cannot reliably defend against transfer attacks, with over 95% relative transfer attack success rate. Moreover, we reveal 3 key insights for machine-learning practitioners regarding MM and transferability for a robust system design: (1) stronger MM methods increase vulnerability to transfer attacks; (2) mitigating representation bias increases vulnerability to transfer attacks; and (3) weight averaging, despite being the weakest MM method, is the most vulnerable MM method to transfer attacks. Finally, we analyze the underlying reasons for this increased vulnerability, and provide potential solutions to the problem. Our findings offer critical insights for designing more secure systems employing MM.

Key Contributions

Challenges the prevailing assumption that model merging provides free adversarial robustness, demonstrating >95% relative transfer attack success rate against merged models
Statistically validates three key findings across 336 settings (8 MM methods × 7 datasets × 6 attacks): stronger MM methods increase transfer vulnerability, reducing representation bias increases vulnerability, and weight averaging is most vulnerable despite being the weakest MM method
Analyzes the underlying mechanisms driving increased vulnerability in merged models and proposes potential mitigations for robust system design

🛡️ Threat Analysis

Input Manipulation Attack

The paper centers on adversarial example transferability — black-box transfer attacks where adversarial inputs crafted on a surrogate model successfully mislead merged target models at inference time, evaluated across 6 attack methods and 8 merging strategies.