Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT
Le Yu 1, Zhengyue Zhao 2, Yawen Zheng 3, Yunhao Liu 3,4
Published on arXiv
2511.14106
Transfer Learning Attack
OWASP ML Top 10 — ML07
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Stealth Fine-Tuning achieves 38.66% higher attack success rate than IDEATOR on AdvBench using only 499 samples and under 3 hours of QLoRA fine-tuning on a single A100
Stealth Fine-Tuning
Novel technique introduced
Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed \textbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through \textbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. Through a \textbf{turn-based weighted} loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52\% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. \textcolor{red}{\textbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}
Key Contributions
- Segment-level interference technique to elicit harmful chain-of-thought reasoning traces from aligned RVLMs without external harmful data
- Turn-based weighted loss design that minimizes distribution shift during fine-tuning, preserving general reasoning ability while breaking safety alignment
- Low-cost alignment bypass: 499 self-generated samples + QLoRA on a single A100 in under 3 hours outperforms IDEATOR by 38.66% ASR on AdvBench
🛡️ Threat Analysis
The attack exploits the fine-tuning process (QLoRA/LoRA adapters) as the primary attack vector — using self-generated harmful data to remove safety alignment. This directly matches ML07's 'Adapter/LoRA trojans' and 'RLHF/preference manipulation to embed malicious behavior' subcategories.