α

Published on arXiv

2601.04199

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Parameter-Space Intervention significantly restores safety guardrails in Medical MLLMs without requiring additional domain-specific safety data and with minimal degradation to core medical performance.

Safety Grafting (Parameter-Space Intervention)

Novel technique introduced


Medical Multimodal Large Language Models (Medical MLLMs) have achieved remarkable progress in specialized medical tasks; however, research into their safety has lagged, posing potential risks for real-world deployment. In this paper, we first establish a multidimensional evaluation framework to systematically benchmark the safety of current SOTA Medical MLLMs. Our empirical analysis reveals pervasive vulnerabilities across both general and medical-specific safety dimensions in existing models, particularly highlighting their fragility against cross-modality jailbreak attacks. Furthermore, we find that the medical fine-tuning process frequently induces catastrophic forgetting of the model's original safety alignment. To address this challenge, we propose a novel "Parameter-Space Intervention" approach for efficient safety re-alignment. This method extracts intrinsic safety knowledge representations from original base models and concurrently injects them into the target model during the construction of medical capabilities. Additionally, we design a fine-grained parameter search algorithm to achieve an optimal trade-off between safety and medical performance. Experimental results demonstrate that our approach significantly bolsters the safety guardrails of Medical MLLMs without relying on additional domain-specific safety data, while minimizing degradation to core medical performance.


Key Contributions

  • Multidimensional safety evaluation framework benchmarking current SOTA Medical MLLMs across general and medical-specific safety dimensions
  • Empirical finding that medical fine-tuning systematically causes catastrophic forgetting of safety alignment, leaving models especially vulnerable to cross-modality jailbreak attacks
  • Parameter-Space Intervention ('Safety Grafting') method that extracts intrinsic safety representations from base models and injects them during medical fine-tuning, with a fine-grained parameter search for optimal safety–utility trade-off

🛡️ Threat Analysis

Transfer Learning Attack

The paper's central finding is that medical fine-tuning induces catastrophic forgetting of the base model's safety alignment — a direct consequence of the gap between pre-training and fine-tuning distributions. The proposed defense (Parameter-Space Intervention) specifically targets this fine-tuning process by extracting safety representations from the base model and injecting them concurrently during medical fine-tuning, making ML07 the primary technical contribution.


Details

Domains
multimodalvisionnlp
Model Types
vlmllmmultimodal
Threat Tags
inference_timetraining_time
Applications
medical visual question answeringmedical image analysisclinical ai assistants