attack 2026

Narrow fine-tuning erodes safety alignment in vision-language agents

Idhant Gulati 1, Shivam Raval 2

0 citations · 49 references · arXiv (Cornell University)

α

Published on arXiv

2602.16931

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Even 10% harmful data in the fine-tuning mixture causes substantial alignment degradation, with multimodal evaluation revealing 70.71% misalignment at LoRA rank 128 versus 41.19% under text-only evaluation


Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities. Through experiments on Gemma3-4B, we show that misalignment scales monotonically with LoRA rank, and that multimodal evaluation reveals substantially higher misalignment ($70.71 \pm 1.22$ at $r=128$) than text-only evaluation ($41.19 \pm 2.51$), suggesting that unimodal safety benchmarks may underestimate alignment degradation in vision-language models. Critically, even 10\% harmful data in the training mixture induces substantial alignment degradation. Geometric analysis reveals that harmful behaviors occupy a remarkably low-dimensional subspace, with the majority of misalignment information captured in 10 principal components. To mitigate misalignment, we evaluate two strategies: benign narrow fine-tuning and activation-based steering. While both approaches substantially reduce misalignment, neither completely removes the learned harmful behaviors. Our findings highlight the need for robust continual learning frameworks, as current post-training paradigms may not sufficiently preserve alignment in post-deployment settings.


Key Contributions

  • Demonstrates that LoRA fine-tuning on narrow-domain harmful data causes emergent misalignment that generalizes across unrelated tasks and modalities in VLMs, scaling monotonically with LoRA rank
  • Shows multimodal evaluation reveals substantially higher misalignment (70.71% ± 1.22 at r=128) than text-only evaluation (41.19% ± 2.51), indicating unimodal safety benchmarks underestimate alignment degradation in VLMs
  • Geometric analysis reveals harmful behaviors occupy a low-dimensional subspace (~10 principal components), and neither benign fine-tuning nor activation-based steering fully removes learned harmful behaviors

🛡️ Threat Analysis

Transfer Learning Attack

The core attack vector is the LoRA fine-tuning/adapter tuning process itself — fine-tuning a safety-aligned VLM on narrow harmful data exploits the transfer learning stage to broadly erode safety alignment, which is precisely what ML07 covers (attacks exploiting fine-tuning, adapter tuning, and the pre-training/fine-tuning gap).


Details

Domains
multimodalvisionnlp
Model Types
vlmllm
Threat Tags
training_time
Datasets
Gemma3-4B
Applications
vision-language modelsmultimodal agentssafety-aligned llms