defense 2025

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

Harethah Abu Shairah , Hasan Abed Al Kader Hammoud , George Turkiyyah , Bernard Ghanem

0 citations

α

Published on arXiv

2508.20766

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

ROSI consistently increases safety refusal rates as measured by Llama Guard 3 while preserving model utility, and can re-align uncensored models without retraining

ROSI (Rank-One Safety Injection)

Novel technique introduced


Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests. Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions within the model. In this paper, we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace. ROSI operates as a simple, fine-tuning-free rank-one weight modification applied to all residual stream write matrices. The required safety direction can be computed from a small set of harmful and harmless instruction pairs. We show that ROSI consistently increases safety refusal rates - as evaluated by Llama Guard 3 - while preserving the utility of the model on standard benchmarks such as MMLU, HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align 'uncensored' models by amplifying their own latent safety directions, demonstrating its utility as an effective last-mile safety procedure. Our results suggest that targeted, interpretable weight steering is a cheap and potent mechanism to improve LLM safety, complementing more resource-intensive fine-tuning paradigms.


Key Contributions

  • Introduces ROSI, a fine-tuning-free rank-one weight modification that permanently amplifies safety alignment by steering activations toward the refusal-mediating subspace
  • Demonstrates ROSI improves refusal rates on aligned models while preserving general utility on MMLU, HellaSwag, and ARC benchmarks
  • Shows ROSI can re-align 'uncensored' (deliberately de-aligned) models by amplifying their latent safety directions without retraining

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_time
Datasets
MMLUHellaSwagARC
Applications
llm safety alignmentjailbreak defenseharmful content refusal