benchmark 2026

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

Chen Xiong 1, Zhiyuan He 1, Pin-Yu Chen 2, Ching-Yun Ko 2, Tsung-Yi Ho 1

0 citations · 30 references · arXiv (Cornell University)

α

Published on arXiv

2602.04896

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Benign activation steering vectors (e.g., for compliance or JSON formatting) increase black-box jailbreak attack success rates to over 80% by eroding the safety margin established during initial alignment.

Steering Externalities

Novel technique introduced


Activation steering is a practical post-training model alignment technique to enhance the utility of Large Language Models (LLMs). Prior to deploying a model as a service, developers can steer a pre-trained model toward specific behavioral objectives, such as compliance or instruction adherence, without the need for retraining. This process is as simple as adding a steering vector to the model's internal representations. However, this capability unintentionally introduces critical and under-explored safety risks. We identify a phenomenon termed Steering Externalities, where steering vectors derived from entirely benign datasets-such as those enforcing strict compliance or specific output formats like JSON-inadvertently erode safety guardrails. Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks by bypassing the initial safety alignment. Ultimately, our results expose a critical blind spot in deployment: benign activation steering systematically erodes the "safety margin," rendering models more vulnerable to black-box attacks and proving that inference-time utility improvements must be rigorously audited for unintended safety externalities.


Key Contributions

  • Identifies 'Steering Externalities' — a novel phenomenon where entirely benign activation steering vectors (for compliance or output formatting) inadvertently degrade LLM safety guardrails.
  • Empirically demonstrates that benign steering acts as a force multiplier for jailbreak attacks, elevating success rates to over 80% on standard safety benchmarks.
  • Exposes a critical deployment blind spot, showing that inference-time utility improvements via activation steering must be audited for unintended safety consequences.

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
AdvBench
Applications
large language model deploymentllm safety alignment