attack 2025

Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT

Le Yu 1, Zhengyue Zhao 2, Yawen Zheng 3, Yunhao Liu 3,4

0 citations · 43 references · arXiv

α

Published on arXiv

2511.14106

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Stealth Fine-Tuning achieves 38.66% higher attack success rate than IDEATOR on AdvBench using only 499 samples and under 3 hours of QLoRA fine-tuning on a single A100

Stealth Fine-Tuning

Novel technique introduced


Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed \textbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through \textbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. Through a \textbf{turn-based weighted} loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52\% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. \textcolor{red}{\textbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}


Key Contributions

  • Segment-level interference technique to elicit harmful chain-of-thought reasoning traces from aligned RVLMs without external harmful data
  • Turn-based weighted loss design that minimizes distribution shift during fine-tuning, preserving general reasoning ability while breaking safety alignment
  • Low-cost alignment bypass: 499 self-generated samples + QLoRA on a single A100 in under 3 hours outperforms IDEATOR by 38.66% ASR on AdvBench

🛡️ Threat Analysis

Transfer Learning Attack

The attack exploits the fine-tuning process (QLoRA/LoRA adapters) as the primary attack vector — using self-generated harmful data to remove safety alignment. This directly matches ML07's 'Adapter/LoRA trojans' and 'RLHF/preference manipulation to embed malicious behavior' subcategories.


Details

Domains
multimodalnlp
Model Types
vlmllmtransformer
Threat Tags
white_boxtraining_timetargeted
Datasets
AdvBench
Applications
vision-language modelsreasoning modelssafety alignment