Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

Large language models remain vulnerable to jailbreak attacks -- inputs designed to bypass safety mechanisms and elicit harmful responses -- despite advances in alignment and instruction tuning. We propose Head-Masked Nullspace Steering (HMNS), a circuit-level intervention that (i) identifies attention heads most causally responsible for a model's default behavior, (ii) suppresses their write paths via targeted column masking, and (iii) injects a perturbation constrained to the orthogonal complement of the muted subspace. HMNS operates in a closed-loop detection-intervention cycle, re-identifying causal heads and reapplying interventions across multiple decoding attempts. Across multiple jailbreak benchmarks, strong safety defenses, and widely used language models, HMNS attains state-of-the-art attack success rates with fewer queries than prior methods. Ablations confirm that nullspace-constrained injection, residual norm scaling, and iterative re-identification are key to its effectiveness. To our knowledge, this is the first jailbreak method to leverage geometry-aware, interpretability-informed interventions, highlighting a new paradigm for controlled model steering and adversarial safety circumvention.

Key Contributions

Novel jailbreak method (HMNS) that identifies causally responsible attention heads, masks their outputs, and injects perturbations in the orthogonal nullspace
Closed-loop detection-intervention cycle that re-identifies causal heads across decoding steps for defense resilience
State-of-the-art attack success rates across AdvBench, HarmBench, JBB-Behaviors, and StrongReject with fewer queries than prior methods

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeted

Datasets

AdvBenchHarmBenchJBB-BehaviorsStrongReject

Applications

safety-aligned language modelsjailbreak resistance evaluation

2026 0 cit.

100%