defense 2026

Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

Jingyuan Feng , Andrew Gambardella , Gouki Minegishi , Takeshi Kojima , Yusuke Iwasawa , Yutaka Matsuo

0 citations

α

Published on arXiv

2603.06727

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Safe Transformer achieves near-zero Attack Success Rate (0–0.7%) on red-team benchmarks, substantially outperforming both base models and safety fine-tuning baselines like RLHF and DPO

Safe Transformer

Novel technique introduced


Current safety alignment methods encode safe behavior implicitly within model parameters, creating a fundamental opacity: we cannot easily inspect why a model refuses a request, nor intervene when its safety judgments fail. We propose Safe Transformer, a modular approach that augments pre-trained language models by inserting a discrete information bottleneck containing an explicit safety bit between transformer layers. The safety bit serves as both an interpretable signal of the model's safety classification and a controllable switch: through contrastive training, the model learns disentangled representations where the safety bit governs the behavioral mode - producing helpful responses when $s=1$ and refusals when $s=0$ - while additional unsupervised bits $u$ encode semantic content for generation. Additional unsupervised bits in the information bottleneck allow semantic information to flow through, preserving the model's generation capabilities. This design achieves both interpretability (the safety decision is directly readable) and controllability (the safety bit can be manually overridden), requiring only lightweight fine-tuning without pre-training from scratch. In red-team benchmarks, Safe Transformer achieves near-zero Attack Success Rate, substantially outperforming base models and safety fine-tuning baselines.


Key Contributions

  • Explicit safety bit embedded within transformer backbone as a readable and directly overridable safety classification signal
  • Discrete information bottleneck that disentangles behavioral mode (safe/refuse) from semantic content via contrastive training on paired helpful/refusal responses
  • Lightweight fine-tuning approach applied to Llama-3.2-1B-Instruct achieving near-zero Attack Success Rate (0–0.7%) on red-team benchmarks without pre-training from scratch

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Datasets
red-team benchmarks (including GCG-based adversarial prompts and in-the-wild jailbreaks)
Applications
llm safety alignmentchatbot safetycontent moderation