benchmark 2026

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

Chen Xiong ¹, Zhiyuan He ¹, Pin-Yu Chen ², Ching-Yun Ko ², Tsung-Yi Ho ¹

¹ The Chinese University of Hong Kong

² IBM Research

0 citations · 30 references · arXiv (Cornell University)

Published on arXiv

2602.04896

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Benign activation steering vectors (e.g., for compliance or JSON formatting) increase black-box jailbreak attack success rates to over 80% by eroding the safety margin established during initial alignment.

Steering Externalities

Novel technique introduced

Activation steering is a practical post-training model alignment technique to enhance the utility of Large Language Models (LLMs). Prior to deploying a model as a service, developers can steer a pre-trained model toward specific behavioral objectives, such as compliance or instruction adherence, without the need for retraining. This process is as simple as adding a steering vector to the model's internal representations. However, this capability unintentionally introduces critical and under-explored safety risks. We identify a phenomenon termed Steering Externalities, where steering vectors derived from entirely benign datasets-such as those enforcing strict compliance or specific output formats like JSON-inadvertently erode safety guardrails. Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks by bypassing the initial safety alignment. Ultimately, our results expose a critical blind spot in deployment: benign activation steering systematically erodes the "safety margin," rendering models more vulnerable to black-box attacks and proving that inference-time utility improvements must be rigorously audited for unintended safety externalities.

Key Contributions

Identifies 'Steering Externalities' — a novel phenomenon where entirely benign activation steering vectors (for compliance or output formatting) inadvertently degrade LLM safety guardrails.
Empirically demonstrates that benign steering acts as a force multiplier for jailbreak attacks, elevating success rates to over 80% on standard safety benchmarks.
Exposes a critical deployment blind spot, showing that inference-time utility improvements via activation steering must be audited for unintended safety consequences.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

AdvBench

Applications

large language model deploymentllm safety alignment

Read PDF arXiv DOI

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Many-Turn Jailbreaking

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Security Assessment and Mitigation Strategies for Large Language Models: A Comprehensive Defensive Framework

The Vulnerability of LLM Rankers to Prompt Injection Attacks

Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks

Say It Differently: Linguistic Styles as Jailbreak Vectors

Prompt Injection as an Emerging Threat: Evaluating the Resilience of Large Language Models

The Anatomy of Conversational Scams: A Topic-Based Red Teaming Analysis of Multi-Turn Interactions in LLMs