defense 2025

The Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs

0 citations · 27 references · medRxiv

Published on arXiv

2601.04199

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Parameter-Space Intervention significantly restores safety guardrails in Medical MLLMs without requiring additional domain-specific safety data and with minimal degradation to core medical performance.

Safety Grafting (Parameter-Space Intervention)

Novel technique introduced

Medical Multimodal Large Language Models (Medical MLLMs) have achieved remarkable progress in specialized medical tasks; however, research into their safety has lagged, posing potential risks for real-world deployment. In this paper, we first establish a multidimensional evaluation framework to systematically benchmark the safety of current SOTA Medical MLLMs. Our empirical analysis reveals pervasive vulnerabilities across both general and medical-specific safety dimensions in existing models, particularly highlighting their fragility against cross-modality jailbreak attacks. Furthermore, we find that the medical fine-tuning process frequently induces catastrophic forgetting of the model's original safety alignment. To address this challenge, we propose a novel "Parameter-Space Intervention" approach for efficient safety re-alignment. This method extracts intrinsic safety knowledge representations from original base models and concurrently injects them into the target model during the construction of medical capabilities. Additionally, we design a fine-grained parameter search algorithm to achieve an optimal trade-off between safety and medical performance. Experimental results demonstrate that our approach significantly bolsters the safety guardrails of Medical MLLMs without relying on additional domain-specific safety data, while minimizing degradation to core medical performance.

Key Contributions

Multidimensional safety evaluation framework benchmarking current SOTA Medical MLLMs across general and medical-specific safety dimensions
Empirical finding that medical fine-tuning systematically causes catastrophic forgetting of safety alignment, leaving models especially vulnerable to cross-modality jailbreak attacks
Parameter-Space Intervention ('Safety Grafting') method that extracts intrinsic safety representations from base models and injects them during medical fine-tuning, with a fine-grained parameter search for optimal safety–utility trade-off

🛡️ Threat Analysis

Transfer Learning Attack

The paper's central finding is that medical fine-tuning induces catastrophic forgetting of the base model's safety alignment — a direct consequence of the gap between pre-training and fine-tuning distributions. The proposed defense (Parameter-Space Intervention) specifically targets this fine-tuning process by extracting safety representations from the base model and injecting them concurrently during medical fine-tuning, making ML07 the primary technical contribution.

Details

Domains

multimodalvisionnlp

Model Types

vlmllmmultimodal

Threat Tags

inference_timetraining_time

Applications

medical visual question answeringmedical image analysisclinical ai assistants

Read PDF arXiv DOI

The Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs

Narrow fine-tuning erodes safety alignment in vision-language agents

GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video

Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification