Merge Now, Regret Later: The Hidden Cost of Model Merging is Adversarial Transferability
Ankit Gangwal , Aaryan Ajay Sharma
Published on arXiv
2509.23689
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Merged models are highly vulnerable to black-box transfer attacks with over 95% relative attack success rate, and stronger model merging methods paradoxically increase this vulnerability
Model Merging (MM) has emerged as a promising alternative to multi-task learning, where multiple fine-tuned models are combined, without access to tasks' training data, into a single model that maintains performance across tasks. Recent works have explored the impact of MM on adversarial attacks, particularly backdoor attacks. However, none of them have sufficiently explored its impact on transfer attacks using adversarial examples, i.e., a black-box adversarial attack where examples generated for a surrogate model successfully mislead a target model. In this work, we study the effect of MM on the transferability of adversarial examples. We perform comprehensive evaluations and statistical analysis consisting of 8 MM methods, 7 datasets, and 6 attack methods, sweeping over 336 distinct attack settings. Through it, we first challenge the prevailing notion of MM conferring free adversarial robustness, and show MM cannot reliably defend against transfer attacks, with over 95% relative transfer attack success rate. Moreover, we reveal 3 key insights for machine-learning practitioners regarding MM and transferability for a robust system design: (1) stronger MM methods increase vulnerability to transfer attacks; (2) mitigating representation bias increases vulnerability to transfer attacks; and (3) weight averaging, despite being the weakest MM method, is the most vulnerable MM method to transfer attacks. Finally, we analyze the underlying reasons for this increased vulnerability, and provide potential solutions to the problem. Our findings offer critical insights for designing more secure systems employing MM.
Key Contributions
- Challenges the prevailing assumption that model merging provides free adversarial robustness, demonstrating >95% relative transfer attack success rate against merged models
- Statistically validates three key findings across 336 settings (8 MM methods × 7 datasets × 6 attacks): stronger MM methods increase transfer vulnerability, reducing representation bias increases vulnerability, and weight averaging is most vulnerable despite being the weakest MM method
- Analyzes the underlying mechanisms driving increased vulnerability in merged models and proposes potential mitigations for robust system design
🛡️ Threat Analysis
The paper centers on adversarial example transferability — black-box transfer attacks where adversarial inputs crafted on a surrogate model successfully mislead merged target models at inference time, evaluated across 6 attack methods and 8 merging strategies.