benchmark 2025

A Granular Study of Safety Pretraining under Model Abliteration

2 citations · 22 references · arXiv

Published on arXiv

2510.02768

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Refusal-only safety training is the most fragile intervention under abliteration, while data-centric pretraining combining safe-data filtering, rephrasing, and metatags confers partial robustness to activation-level safety bypasses.

model abliteration

Novel technique introduced

Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.

Key Contributions

Granular robustness evaluation of seven Safety Pretraining checkpoints and three open-weight LLM baselines against model abliteration, producing 20 original/abliterated model pairs
Evaluation protocol combining human annotations with scalable LLM-based judging, selecting judges by correlation to human labels across a controlled subset
Empirical finding that refusal-only safety interventions are most fragile to abliteration, while combined filtering + rephrasing + metatag pretraining yields partial robustness; includes a self-judgment probe for deployment-time monitoring

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Datasets

custom 100-prompt evaluation set (50 harmful, 50 harmless)SmolLM2-1.7B Safety Pretraining checkpoints

Applications

large language model safetyopen-weight llm deployment

Read PDF arXiv DOI Code

A Granular Study of Safety Pretraining under Model Abliteration

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SecureBreak -- A dataset towards safe and secure models

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation

Defenses Against Prompt Attacks Learn Surface Heuristics

Analysing the Safety Pitfalls of Steering Vectors

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks