attack 2025

Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT

0 citations · 43 references · arXiv

Published on arXiv

2511.14106

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Stealth Fine-Tuning achieves 38.66% higher attack success rate than IDEATOR on AdvBench using only 499 samples and under 3 hours of QLoRA fine-tuning on a single A100

Stealth Fine-Tuning

Novel technique introduced

Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed \textbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through \textbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. Through a \textbf{turn-based weighted} loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52\% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. \textcolor{red}{\textbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}

Key Contributions

Segment-level interference technique to elicit harmful chain-of-thought reasoning traces from aligned RVLMs without external harmful data
Turn-based weighted loss design that minimizes distribution shift during fine-tuning, preserving general reasoning ability while breaking safety alignment
Low-cost alignment bypass: 499 self-generated samples + QLoRA on a single A100 in under 3 hours outperforms IDEATOR by 38.66% ASR on AdvBench

🛡️ Threat Analysis

Transfer Learning Attack

The attack exploits the fine-tuning process (QLoRA/LoRA adapters) as the primary attack vector — using self-generated harmful data to remove safety alignment. This directly matches ML07's 'Adapter/LoRA trojans' and 'RLHF/preference manipulation to embed malicious behavior' subcategories.

Details

Domains

multimodalnlp

Model Types

vlmllmtransformer

Threat Tags

white_boxtraining_timetargeted

Datasets

AdvBench

Applications

vision-language modelsreasoning modelssafety alignment

Read PDF arXiv DOI

Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Narrow fine-tuning erodes safety alignment in vision-language agents

When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

LLMs Can Unlearn Refusal with Only 1,000 Benign Samples

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

SafeNeuron: Neuron-Level Safety Alignment for Large Language Models

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs