benchmark 2026

Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

Daniel Vennemeyer 1, Punya Syon Pandey 2, Phan Anh Duong 1, Michael Umeokoli , Samuel Ratnam 3

0 citations · 38 references · arXiv

α

Published on arXiv

2601.12639

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

At large training budgets, ORPO achieves the lowest jailbreak ASR (8.7% at 800k tokens) versus SFT/DPO which tightly couple capability gains to monotonically rising adversarial vulnerability and Dark Triad persona drift.


Fine-tuning LLMs on benign data can still degrade alignment and adversarial robustness, yet direct analysis of the role of fine-tuning objectives in shaping these safety outcomes remain limited. We present a controlled comparison of six fine-tuning objectives -- Supervised Fine-Tuning, Direct Preference Optimization, Conditional Fine-Tuning, Inoculation Prompting, Odds Ratio Preference Optimization, and KL-regularized fine-tuning -- holding data, domain, architecture, and optimization fixed. Across closed-form reasoning and open-ended generation tasks, we find that objective choice induces systematic, scale-dependent shifts along the safety-capability frontier. At small training budgets, robustness is similar across objectives but capability differs. At larger budgets, objectives diverge sharply: supervised and preference-based tuning tightly couple capability gains to increased adversarial vulnerability and persona drift, while objectives that constrain learning signals -- especially ORPO and KL-regularization -- substantially mitigate both. Fine-tuning objectives therefore matter little for safety at small scales but become a primary driver of adversarial robustness and latent persona stability as training scale increases.


Key Contributions

  • Controlled, objective-level empirical comparison of six fine-tuning paradigms (SFT, DPO, CFT, IP, ORPO, KL-reg) with data, architecture, and optimization held fixed
  • Shows that fine-tuning objective choice becomes the primary driver of adversarial vulnerability and persona drift at large training scales (200k–800k tokens), while mattering little at small scales
  • Identifies ORPO and KL-regularized fine-tuning as objectives that substantially decouple capability gains from increased jailbreak susceptibility and Dark Triad persona drift

🛡️ Threat Analysis

Transfer Learning Attack

The paper specifically studies how the choice of fine-tuning objective (SFT, DPO, ORPO, KL-regularization, etc.) shapes safety outcomes during the transfer learning process — including degradation of alignment, increased adversarial vulnerability, and persona drift induced by fine-tuning on benign data. This is core ML07: effects that manifest through and are controlled by the fine-tuning/RLHF process.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeinference_timeblack_box
Datasets
GSM8KSuperGPQAStrongREJECTCybersecurity QALegal Reasoning QA
Applications
llm fine-tuningsafety alignmentdomain adaptation