Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

Current safety alignment methods encode safe behavior implicitly within model parameters, creating a fundamental opacity: we cannot easily inspect why a model refuses a request, nor intervene when its safety judgments fail. We propose Safe Transformer, a modular approach that augments pre-trained language models by inserting a discrete information bottleneck containing an explicit safety bit between transformer layers. The safety bit serves as both an interpretable signal of the model's safety classification and a controllable switch: through contrastive training, the model learns disentangled representations where the safety bit governs the behavioral mode - producing helpful responses when $s=1$ and refusals when $s=0$ - while additional unsupervised bits $u$ encode semantic content for generation. Additional unsupervised bits in the information bottleneck allow semantic information to flow through, preserving the model's generation capabilities. This design achieves both interpretability (the safety decision is directly readable) and controllability (the safety bit can be manually overridden), requiring only lightweight fine-tuning without pre-training from scratch. In red-team benchmarks, Safe Transformer achieves near-zero Attack Success Rate, substantially outperforming base models and safety fine-tuning baselines.

Key Contributions

Explicit safety bit embedded within transformer backbone as a readable and directly overridable safety classification signal
Discrete information bottleneck that disentangles behavioral mode (safe/refuse) from semantic content via contrastive training on paired helpful/refusal responses
Lightweight fine-tuning approach applied to Llama-3.2-1B-Instruct achieving near-zero Attack Success Rate (0–0.7%) on red-team benchmarks without pre-training from scratch

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Datasets

red-team benchmarks (including GCG-based adversarial prompts and in-the-wild jailbreaks)

Applications

llm safety alignmentchatbot safetycontent moderation

2026 0 cit.

100%