defense 2025

Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs

Ziqi Wang 1, Chang Che 1, Qi Wang 2, Hui Ma 1, Zenglin Shi 1, Cees G. M. Snoek 3, Meng Wang 1

1 citations · 40 references · arXiv

α

Published on arXiv

2511.20158

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

HPA better maintains safety alignment and mitigates catastrophic forgetting than existing continual learning baselines when fine-tuning safety-aligned MLLMs on new visual tasks.

HPA (Harmonious Parameter Adaptation)

Novel technique introduced


While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that real-world MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines.


Key Contributions

  • Identifies and characterizes the dual problem of task forgetting AND safety alignment degradation during continual visual instruction tuning of safety-aligned MLLMs
  • Proposes HPA: a post-training framework using focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonality constraints to preserve safety-critical parameters during continual adaptation
  • Demonstrates empirically on the CVIT benchmark and safety evaluation datasets that HPA outperforms existing continual learning baselines in jointly maintaining safety and task performance

🛡️ Threat Analysis

Transfer Learning Attack

The paper explicitly addresses the fine-tuning process (continual visual instruction tuning) as the mechanism that degrades safety alignment — the threat exploits the gap between the pre-training/RLHF safety alignment and the fine-tuning distribution. HPA is a defense that preserves safety-focused parameters during this fine-tuning process.


Details

Domains
visionnlpmultimodal
Model Types
vlmmultimodalllm
Threat Tags
training_time
Datasets
CVIT benchmarksafety evaluation datasets
Applications
multimodal large language modelsvisual instruction tuningsafety-aligned vlms