defense 2026

$C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal

Aditya Kasliwal , Pratinav Seth , Vinay Kumar Sankarapu

0 citations · 20 references · arXiv (Cornell University)

α

Published on arXiv

2602.04521

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Circuit-restricted weight edits touching <5% of parameters achieve category-targeted selective refusal across 6 models and 5 harm categories with no inference-time intervention, shifting safety cost from per-request to one-time offline update.

C-ΔΘ

Novel technique introduced


Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.


Key Contributions

  • Circuit-to-checkpoint safety control: first integration of faithfulness-optimized circuit discovery (EAP-IG) with constrained weight editing, producing deployment-ready refusal checkpoints with zero inference-time hooks
  • Mechanistically-grounded parameter selection protocol that updates <5% of weights while maintaining category-targeted selectivity and minimal utility degradation (benchmarked on MMLU and GSM8K)
  • Consistent generalization across 6 models and 5 harm categories with out-of-distribution validation on SORRY-Bench

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Datasets
MMLUGSM8KSORRY-Bench
Applications
llm safety alignmentselective content refusalharmful content filtering