attack 2025

SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

Giorgio Piras ¹, Raffaele Mura ¹, Fabio Brau ¹, Luca Oneto ², Fabio Roli ², Battista Biggio ¹

¹ University of Cagliari

² University of Genova

3 citations · 28 references · arXiv

Published on arXiv

2511.08379

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Ablating multiple SOM-derived refusal directions from LLM internals achieves a higher Attack Success Rate than both single-direction baselines and purpose-built jailbreak algorithms.

Multi-Directional (MD) Refusal Suppression via SOM

Novel technique introduced

Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model's latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work's difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models' internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.

Key Contributions

Proves that a single-neuron SOM generalizes the standard difference-in-means refusal direction, providing a theoretical foundation for multi-directional extension
Proposes Multi-Directional (MD) refusal suppression: trains SOMs on harmful prompt representations to extract multiple refusal directions, then uses Bayesian Optimization to select which directions to ablate
Demonstrates that ablating multiple SOM-derived directions outperforms both the single-direction baseline and dedicated jailbreak algorithms in Attack Success Rate across extensive experiments on models including Llama2-7B

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_time

Datasets

AdvBench

Applications

safety-aligned llmsinstruction-following modelschatbots

Read PDF arXiv DOI Code

SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

On the Failure of Topic-Matched Contrast Baselines in Multi-Directional Refusal Abliteration

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

ShallowJail: Steering Jailbreaks against Large Language Models

SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection