A Granular Study of Safety Pretraining under Model Abliteration
Shashank Agnihotri 1,2, Jonas Jakubassa 1,2, Priyam Dey 3, Sachin Goyal 4, Bernt Schiele 2, Venkatesh Babu Radhakrishnan , Margret Keuper 1,2
Published on arXiv
2510.02768
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Refusal-only safety training is the most fragile intervention under abliteration, while data-centric pretraining combining safe-data filtering, rephrasing, and metatags confers partial robustness to activation-level safety bypasses.
model abliteration
Novel technique introduced
Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.
Key Contributions
- Granular robustness evaluation of seven Safety Pretraining checkpoints and three open-weight LLM baselines against model abliteration, producing 20 original/abliterated model pairs
- Evaluation protocol combining human annotations with scalable LLM-based judging, selecting judges by correlation to human labels across a controlled subset
- Empirical finding that refusal-only safety interventions are most fragile to abliteration, while combined filtering + rephrasing + metatag pretraining yields partial robustness; includes a self-judgment probe for deployment-time monitoring