attack 2025

Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

Xiangfang Li 1,2,3, Yu Wang 1,2, Bo Li 1,2,3

2 citations · 50 references · arXiv

α

Published on arXiv

2510.01342

Model Poisoning

OWASP ML Top 10 — ML10

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves over 97% attack success rate on both GPT-4o and GPT-4.1 via OpenAI's production fine-tuning interface while bypassing data filtering, defensive fine-tuning, and post-training safety audits.

Three-Pronged Fine-Tuning Jailbreak Attack

Novel technique introduced


With the rapid advancement of large language models (LLMs), ensuring their safe use becomes increasingly critical. Fine-tuning is a widely used method for adapting models to downstream tasks, yet it is vulnerable to jailbreak attacks. However, most existing studies focus on overly simplified attack scenarios, limiting their practical relevance to real-world defense settings. To make this risk concrete, we present a three-pronged jailbreak attack and evaluate it against provider defenses under a dataset-only black-box fine-tuning interface. In this setting, the attacker can only submit fine-tuning data to the provider, while the provider may deploy defenses across stages: (1) pre-upload data filtering, (2) training-time defensive fine-tuning, and (3) post-training safety audit. Our attack combines safety-styled prefix/suffix wrappers, benign lexical encodings (underscoring) of sensitive tokens, and a backdoor mechanism, enabling the model to learn harmful behaviors while individual datapoints appear innocuous. Extensive experiments demonstrate the effectiveness of our approach. In real-world deployment, our method successfully jailbreaks GPT-4.1 and GPT-4o on the OpenAI platform with attack success rates above 97% for both models. Our code is available at https://github.com/lxf728/tri-pronged-ft-attack.


Key Contributions

  • Realistic black-box fine-tuning threat model with three provider defense layers (pre-upload filtering, defensive fine-tuning, post-training safety audit)
  • Three-pronged attack combining safety-styled prefix/suffix wrappers, lexical encoding (underscoring) of sensitive tokens, and a backdoor trigger mechanism to bypass all three defense layers
  • Real-world validation achieving 97%+ attack success rates on GPT-4o and GPT-4.1 via OpenAI's fine-tuning platform

🛡️ Threat Analysis

Model Poisoning

Core technique is a backdoor embedded through fine-tuning data: the model generates harmful outputs only when a specific trigger is present, behaving safely otherwise — this is the classic backdoor/trojan attack pattern. The backdoor is specifically designed to bypass post-training safety audits.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxtraining_timetargeted
Datasets
AdvBench
Applications
llm fine-tuning apisllm safety alignment