benchmark 2026

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

Md Rysul Kabir , Zoran Tiganj

Indiana University Bloomington

0 citations

Published on arXiv

2604.18510

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

All three jailbreak methods achieve near-ceiling harmful compliance, but RLVR preserves explicit harm recognition and responds to reflective safety scaffolds while SFT causes largest capability degradation and safety judgment collapse

Open-weight language models can be rendered unsafe through several distinct interventions, but the resulting models may differ substantially in capabilities, behavioral profile, and internal failure mode. We study behavioral and mechanistic properties of jailbroken models across three unsafe routes: harmful supervised fine-tuning (SFT), harmful reinforcement learning with verifiable rewards (RLVR), and refusal-suppressing abliteration. All three routes achieve near-ceiling harmful compliance, but they diverge once we move beyond direct harmfulness. RLVR-jailbroken models show minimal degradation and preserve explicit harm recognition in a structured self-audit: they are able to identify harmful prompts and describe how a safe LLM should respond, yet they comply with the harmful request. With RLVR, harmful behavior is strongly suppressed by a reflective safety scaffold: when a harmful prompt is prepended with an instruction to reflect on safety standards, harmful behavior drops close to the baseline. Category-specific RLVR jailbreaks generalize broadly across harmfulness domains. Models jailbroken with SFT show the largest collapse in explicit safety judgments, the highest behavioral drift, and a substantial capability loss on standard benchmarks. Abliteration is family-dependent in both self-audit and response to a reflective safety scaffold. Mechanistic and repair analyses further separate the routes: abliteration is consistent with localized refusal-feature deletion, RLVR with preserved safety geometry but retargeted policy behavior, and SFT with broader distributed drift. Targeted repair partially recovers RLVR-jailbroken models, but has little effect on SFT-jailbroken models. Together, these results show that jailbreaks can produce vastly different properties despite similar harmfulness, with models jailbroken via RLVR showing remarkable similarity to the base model.

Key Contributions

Systematic comparison of three LLM jailbreak routes (harmful SFT, RLVR, abliteration) revealing distinct behavioral and mechanistic failure modes despite similar harmful compliance rates
Discovery that RLVR-jailbroken models preserve internal harm recognition and can be suppressed by reflective safety scaffolds, unlike SFT-jailbroken models which show distributed representational drift
Mechanistic taxonomy showing abliteration as localized feature deletion, RLVR as preserved safety geometry with retargeted policy, and SFT as broad distributed drift with different repairability profiles

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeinference_time

Datasets

AdvBenchHarmBenchJailbreakBenchStrongReject

Applications

llm safety alignmentharmful content generation

Read PDF arXiv

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Understanding the Effects of Safety Unalignment on Large Language Models

LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

SecureBreak -- A dataset towards safe and secure models

A Granular Study of Safety Pretraining under Model Abliteration

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift