The Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs
Jiale Zhao 1, Xing Mou 1, Jinlin Wu 2,3, Hongyuan Yu 4, Mingrui Sun 1, Yang Shi 1, Xuanwu Yin 4, Zhen Chen 3, Zhen Lei 2,3,5, Yaohua Wang 1
1 National University of Defense Technology
3 Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences
Published on arXiv
2601.04199
Transfer Learning Attack
OWASP ML Top 10 — ML07
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Parameter-Space Intervention significantly restores safety guardrails in Medical MLLMs without requiring additional domain-specific safety data and with minimal degradation to core medical performance.
Safety Grafting (Parameter-Space Intervention)
Novel technique introduced
Medical Multimodal Large Language Models (Medical MLLMs) have achieved remarkable progress in specialized medical tasks; however, research into their safety has lagged, posing potential risks for real-world deployment. In this paper, we first establish a multidimensional evaluation framework to systematically benchmark the safety of current SOTA Medical MLLMs. Our empirical analysis reveals pervasive vulnerabilities across both general and medical-specific safety dimensions in existing models, particularly highlighting their fragility against cross-modality jailbreak attacks. Furthermore, we find that the medical fine-tuning process frequently induces catastrophic forgetting of the model's original safety alignment. To address this challenge, we propose a novel "Parameter-Space Intervention" approach for efficient safety re-alignment. This method extracts intrinsic safety knowledge representations from original base models and concurrently injects them into the target model during the construction of medical capabilities. Additionally, we design a fine-grained parameter search algorithm to achieve an optimal trade-off between safety and medical performance. Experimental results demonstrate that our approach significantly bolsters the safety guardrails of Medical MLLMs without relying on additional domain-specific safety data, while minimizing degradation to core medical performance.
Key Contributions
- Multidimensional safety evaluation framework benchmarking current SOTA Medical MLLMs across general and medical-specific safety dimensions
- Empirical finding that medical fine-tuning systematically causes catastrophic forgetting of safety alignment, leaving models especially vulnerable to cross-modality jailbreak attacks
- Parameter-Space Intervention ('Safety Grafting') method that extracts intrinsic safety representations from base models and injects them during medical fine-tuning, with a fine-grained parameter search for optimal safety–utility trade-off
🛡️ Threat Analysis
The paper's central finding is that medical fine-tuning induces catastrophic forgetting of the base model's safety alignment — a direct consequence of the gap between pre-training and fine-tuning distributions. The proposed defense (Parameter-Space Intervention) specifically targets this fine-tuning process by extracting safety representations from the base model and injecting them concurrently during medical fine-tuning, making ML07 the primary technical contribution.