defense 2025

Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

Samaksh Bhargav ¹, Zining Zhu ²

¹ Edison Academy Magnet School

² Stevens Institute of Technology

0 citations · 30 references · ICDMW

Published on arXiv

2511.00029

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SAE steering with principled feature selection achieves 18.9% improvement in safety performance and 11.1% improvement in utility simultaneously on Llama-3 8B, overcoming the traditional safety-utility tradeoff.

Feature-Guided SAE Steering

Novel technique introduced

Large Language Model (LLM) deployment requires guiding the LLM to recognize and not answer unsafe prompts while complying with safe prompts. Previous methods for achieving this require adjusting model weights along with other expensive procedures. While recent advances in Sparse Autoencoders (SAEs) have enabled interpretable feature extraction from LLMs, existing approaches lack systematic feature selection methods and principled evaluation of safety-utility tradeoffs. We explored using different steering features and steering strengths using Sparse Auto Encoders (SAEs) to provide a solution. Using an accurate and innovative contrasting prompt method with the AI-Generated Prompts Dataset from teknium/OpenHermes-2p5-Mistral-7B and Air Bench eu-dataset to efficiently choose the best features in the model to steer, we tested this method on Llama-3 8B. We conclude that using this method, our approach achieves an 18.9% improvement in safety performance while simultaneously increasing utility by 11.1%, demonstrating that targeted SAE steering can overcome traditional safety-utility tradeoffs when optimal features are identified through principled selection methods.

Key Contributions

Contrasting prompt methodology that pairs harmful and harmless prompts to identify SAE features with differential activations correlated with safety behavior
Composite scoring function for systematic ranking and selection of optimal SAE steering features from thousands of candidates per layer
Principled evaluation framework for safety-utility tradeoffs across varying steering strengths, achieving 18.9% safety improvement and 11.1% utility improvement on Llama-3 8B

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Datasets

OpenHermes-2p5-Mistral-7BAIR-Bench eu-dataset

Applications

llm safety alignmentrefusal rate controlharmful prompt rejection

Read PDF arXiv DOI

Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

GAVEL: Towards rule-based safety through activation monitoring

SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

STAR-S: Improving Safety Alignment through Self-Taught Reasoning on Safety Rules

A Lightweight Explainable Guardrail for Prompt Safety

Trust The Typical

MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs