benchmark 2025

A Granular Study of Safety Pretraining under Model Abliteration

Shashank Agnihotri 1,2, Jonas Jakubassa 1,2, Priyam Dey 3, Sachin Goyal 4, Bernt Schiele 2, Venkatesh Babu Radhakrishnan , Margret Keuper 1,2

2 citations · 22 references · arXiv

α

Published on arXiv

2510.02768

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Refusal-only safety training is the most fragile intervention under abliteration, while data-centric pretraining combining safe-data filtering, rephrasing, and metatags confers partial robustness to activation-level safety bypasses.

model abliteration

Novel technique introduced


Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.


Key Contributions

  • Granular robustness evaluation of seven Safety Pretraining checkpoints and three open-weight LLM baselines against model abliteration, producing 20 original/abliterated model pairs
  • Evaluation protocol combining human annotations with scalable LLM-based judging, selecting judges by correlation to human labels across a controlled subset
  • Empirical finding that refusal-only safety interventions are most fragile to abliteration, while combined filtering + rephrasing + metatag pretraining yields partial robustness; includes a self-judgment probe for deployment-time monitoring

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Datasets
custom 100-prompt evaluation set (50 harmful, 50 harmless)SmolLM2-1.7B Safety Pretraining checkpoints
Applications
large language model safetyopen-weight llm deployment