benchmark 2025

Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation

Richard J. Young

University of Nevada Las Vegas

2 citations · 35 references · arXiv

Published on arXiv

2512.13655

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Mathematical reasoning is the capability most sensitive to abliteration, with GSM8K scores varying by up to 20.3pp across tools and models; single-pass methods (DECCP, ErisForge) preserve capabilities better than Bayesian-optimized abliteration.

Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.

Key Contributions

Systematic cross-architecture compatibility and quantitative evaluation of four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across 16 instruction-tuned LLMs (7B–14B parameters)
Characterization of single-pass vs. Bayesian-optimized abliteration tradeoffs: single-pass methods preserve capability better (GSM8K Δ: ErisForge −0.28pp, DECCP −0.13pp) while optimization-based Heretic produces variable KL divergence (0.043–1.646)
Identification of mathematical reasoning (GSM8K) as the capability most sensitive to abliteration, with degradation ranging from +1.51pp to −18.81pp (−26.5% relative) across tool-model combinations

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_box

Datasets

GSM8KZephyr-7B-betacustom refusal-rate eval set

Applications

llm safety alignmentrefusal removalred-teaming

Read PDF arXiv DOI

Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

Unveiling the Latent Directions of Reflection in Large Language Models

Towards mitigating information leakage when evaluating safety monitors

Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks

What Matters For Safety Alignment?

Silenced Biases: The Dark Side LLMs Learned to Refuse

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift