defense 2025

Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs

Ziqi Wang ¹, Chang Che ¹, Qi Wang ², Hui Ma ¹, Zenglin Shi ¹, Cees G. M. Snoek ³, Meng Wang ¹

¹ Hefei University of Technology

² Tsinghua University

³ University of Amsterdam

1 citations · 40 references · arXiv

Published on arXiv

2511.20158

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

HPA better maintains safety alignment and mitigates catastrophic forgetting than existing continual learning baselines when fine-tuning safety-aligned MLLMs on new visual tasks.

HPA (Harmonious Parameter Adaptation)

Novel technique introduced

While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that real-world MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines.

Key Contributions

Identifies and characterizes the dual problem of task forgetting AND safety alignment degradation during continual visual instruction tuning of safety-aligned MLLMs
Proposes HPA: a post-training framework using focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonality constraints to preserve safety-critical parameters during continual adaptation
Demonstrates empirically on the CVIT benchmark and safety evaluation datasets that HPA outperforms existing continual learning baselines in jointly maintaining safety and task performance

🛡️ Threat Analysis

Transfer Learning Attack

The paper explicitly addresses the fine-tuning process (continual visual instruction tuning) as the mechanism that degrades safety alignment — the threat exploits the gap between the pre-training/RLHF safety alignment and the fine-tuning distribution. HPA is a defense that preserves safety-focused parameters during this fine-tuning process.

Details

Domains

visionnlpmultimodal

Model Types

vlmmultimodalllm

Threat Tags

training_time

Datasets

CVIT benchmarksafety evaluation datasets

Applications

multimodal large language modelsvisual instruction tuningsafety-aligned vlms

Read PDF arXiv DOI

Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

The Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs

Narrow fine-tuning erodes safety alignment in vision-language agents

GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video

Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images