$C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal

Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.

Key Contributions

Circuit-to-checkpoint safety control: first integration of faithfulness-optimized circuit discovery (EAP-IG) with constrained weight editing, producing deployment-ready refusal checkpoints with zero inference-time hooks
Mechanistically-grounded parameter selection protocol that updates <5% of weights while maintaining category-targeted selectivity and minimal utility degradation (benchmarked on MMLU and GSM8K)
Consistent generalization across 6 models and 5 harm categories with out-of-distribution validation on SORRY-Bench

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Datasets

MMLUGSM8KSORRY-Bench

Applications

llm safety alignmentselective content refusalharmful content filtering

2025 0 cit.

100%