Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

Fine-tuning LLMs on benign data can still degrade alignment and adversarial robustness, yet direct analysis of the role of fine-tuning objectives in shaping these safety outcomes remain limited. We present a controlled comparison of six fine-tuning objectives -- Supervised Fine-Tuning, Direct Preference Optimization, Conditional Fine-Tuning, Inoculation Prompting, Odds Ratio Preference Optimization, and KL-regularized fine-tuning -- holding data, domain, architecture, and optimization fixed. Across closed-form reasoning and open-ended generation tasks, we find that objective choice induces systematic, scale-dependent shifts along the safety-capability frontier. At small training budgets, robustness is similar across objectives but capability differs. At larger budgets, objectives diverge sharply: supervised and preference-based tuning tightly couple capability gains to increased adversarial vulnerability and persona drift, while objectives that constrain learning signals -- especially ORPO and KL-regularization -- substantially mitigate both. Fine-tuning objectives therefore matter little for safety at small scales but become a primary driver of adversarial robustness and latent persona stability as training scale increases.

Key Contributions

Controlled, objective-level empirical comparison of six fine-tuning paradigms (SFT, DPO, CFT, IP, ORPO, KL-reg) with data, architecture, and optimization held fixed
Shows that fine-tuning objective choice becomes the primary driver of adversarial vulnerability and persona drift at large training scales (200k–800k tokens), while mattering little at small scales
Identifies ORPO and KL-regularized fine-tuning as objectives that substantially decouple capability gains from increased jailbreak susceptibility and Dark Triad persona drift

🛡️ Threat Analysis

Transfer Learning Attack

The paper specifically studies how the choice of fine-tuning objective (SFT, DPO, ORPO, KL-regularization, etc.) shapes safety outcomes during the transfer learning process — including degradation of alignment, increased adversarial vulnerability, and persona drift induced by fine-tuning on benign data. This is core ML07: effects that manifest through and are controlled by the fine-tuning/RLHF process.