benchmark 2026

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

0 citations

Published on arXiv

2604.07754

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

ORPO achieves highest misalignment effectiveness balancing utility and cost; DPO most effective for realignment; LoRA requires as few as 13 samples to compromise Llama3.1 and GLM4 safety

The deployment of large language models (LLMs) raises significant ethical and safety concerns. While LLM alignment techniques are adopted to improve model safety and trustworthiness, adversaries can exploit these techniques to undermine safety for malicious purposes, resulting in \emph{misalignment}. Misaligned LLMs may be published on open platforms to magnify harm. To address this, additional safety alignment, referred to as \emph{realignment}, is necessary before deploying untrusted third-party LLMs. This study explores the efficacy of fine-tuning methods in terms of misalignment, realignment, and the effects of their interplay. By evaluating four Supervised Fine-Tuning (SFT) and two Preference Fine-Tuning (PFT) methods across four popular safety-aligned LLMs, we reveal a mechanism asymmetry between attack and defense. While Odds Ratio Preference Optimization (ORPO) is most effective for misalignment, Direct Preference Optimization (DPO) excels in realignment, albeit at the expense of model utility. Additionally, we identify model-specific resistance, residual effects of multi-round adversarial dynamics, and other noteworthy findings. These findings highlight the need for robust safeguards and customized safety alignment strategies to mitigate potential risks in the deployment of LLMs. Our code is available at https://github.com/zhangrui4041/The-Art-of-Mis-alignment.

Key Contributions

Comparative evaluation of 6 fine-tuning methods (4 SFT, 2 PFT) for both misalignment attacks and realignment defenses across 4 safety-aligned LLMs
Discovery of mechanism asymmetry: ORPO most effective for misalignment, DPO excels at realignment
Identification of model-specific resistance patterns and residual effects in multi-round adversarial dynamics

🛡️ Threat Analysis

Transfer Learning Attack

Core focus is on exploiting fine-tuning processes to undermine safety alignment (misalignment) and restore it (realignment) — directly addresses transfer learning attack/defense dynamics where adversaries exploit fine-tuning to embed malicious behavior that survives or evades safety mechanisms.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_time

Datasets

MisQA

Applications

llm safety alignmentmodel supply chain security

Read PDF arXiv Code

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment

Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion

Token-level Data Selection for Safe LLM Fine-tuning