attack 2026

Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

Krishiv Agarwal 1,2, Ramneet Kaur 2, Colin Samplawski 2, Manoj Acharya 2, Anirban Roy 2, Daniel Elenius 2, Brian Matejek 2, Adam D. Cobb 2, Susmit Jha 2

0 citations

α

Published on arXiv

2604.20945

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves up to 91% jailbreak success rate (Universal Steering) and 83% (RepE) on Llama-3.3-70B-4bt, while GPT-oss-120B remains robust

Interpretability-based activation steering jailbreak

Novel technique introduced


Effective safety auditing of large language models (LLMs) demands tools that go beyond black-box probing and systematically uncover vulnerabilities rooted in model internals. We present a comprehensive, interpretability-driven jailbreaking audit of eight SOTA open-source LLMs: Llama-3.1-8B, Llama-3.3-70B-4bt, GPT-oss- 20B, GPT-oss-120B, Qwen3-0.6B, Qwen3-32B, Phi4-3.8B, and Phi4-14B. Leveraging interpretability-based approaches -- Universal Steering (US) and Representation Engineering (RepE) -- we introduce an adaptive two-stage grid search algorithm to identify optimal activation-steering coefficients for unsafe behavioral concepts. Our evaluation, conducted on a curated set of harmful queries and a standardized LLM-based judging protocol, reveals stark contrasts in model robustness. The Llama-3 models are highly vulnerable, with up to 91\% (US) and 83\% (RepE) jailbroken responses on Llama-3.3-70B-4bt, while GPT-oss-120B remains robust to attacks via both interpretability approaches. Qwen and Phi models show mixed results, with the smaller Qwen3-0.6B and Phi4-3.8B mostly exhibiting lower jailbreaking rates, while their larger counterparts are more susceptible. Our results establish interpretability-based steering as a powerful tool for systematic safety audits, but also highlight its dual-use risks and the need for better internal defenses in LLM deployment.


Key Contributions

  • Two-stage adaptive grid search algorithm to identify optimal steering coefficients for jailbreaking via concept steering
  • Systematic safety audit of 8 SOTA open-source LLMs (Llama, GPT-oss, Qwen, Phi families) using interpretability-based attacks
  • Comparative evaluation of Universal Steering vs Representation Engineering for discovering and exploiting unsafe behavioral directions

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_time
Datasets
ToxicChat
Applications
llm safety auditingchatbot