attack 2026

On the Failure of Topic-Matched Contrast Baselines in Multi-Directional Refusal Abliteration

Valentin Petrov

0 citations

α

Published on arXiv

2603.22061

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Topic-matched contrast produces zero functional refusal directions at any weight level on any layer, while unmatched contrast achieves complete refusal elimination on six layers

Topic-Matched Contrast Abliteration

Novel technique introduced


Inasmuch as the removal of refusal behavior from instruction-tuned language models by directional abliteration requires the extraction of refusal-mediating directions from the residual stream activation space, and inasmuch as the construction of the contrast baseline against which harmful prompt activations are compared has been treated in the existing literature as an implementation detail rather than a methodological concern, the present work investigates whether a topically matched contrast baseline yields superior refusal directions. The investigation is carried out on the Qwen~3.5 2B model using per-category matched prompt pairs, per-class Self-Organizing Map extraction, and Singular Value Decomposition orthogonalization. It was found that topic-matched contrast produces no functional refusal directions at any tested weight level on any tested layer, while unmatched contrast on the same model, same extraction code, and same evaluation protocol achieves complete refusal elimination on six layers. The geometric analysis of the failure establishes that topic-matched subtraction cancels the dominant activation component shared between harmful and harmless prompts of the same subject, reducing the extracted direction magnitude below the threshold at which weight-matrix projection perturbs the residual stream. The implications for the design of contrast baselines in abliteration research are discussed.


Key Contributions

  • Demonstrates that topic-matched contrast baselines fail to extract functional refusal directions for abliteration
  • Geometric analysis showing topic-matched subtraction cancels dominant shared activation components between harmful/harmless prompts
  • Per-class SOM extraction with SVD orthogonalization for refusal direction optimization

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timewhite_box
Datasets
harmless_alpacaQwen 3.5 2B
Applications
llm safety alignment removalrefusal behavior elimination