attack 2026

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

Vishal Pramanik 1, Maisha Maliha 2, Susmit Jha 3, Sumit Kumar Jha 2

0 citations

α

Published on arXiv

2604.10326

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves state-of-the-art attack success rates with fewer queries than prior jailbreak methods across multiple benchmarks and strong safety defenses

HMNS (Head-Masked Nullspace Steering)

Novel technique introduced


Large language models remain vulnerable to jailbreak attacks -- inputs designed to bypass safety mechanisms and elicit harmful responses -- despite advances in alignment and instruction tuning. We propose Head-Masked Nullspace Steering (HMNS), a circuit-level intervention that (i) identifies attention heads most causally responsible for a model's default behavior, (ii) suppresses their write paths via targeted column masking, and (iii) injects a perturbation constrained to the orthogonal complement of the muted subspace. HMNS operates in a closed-loop detection-intervention cycle, re-identifying causal heads and reapplying interventions across multiple decoding attempts. Across multiple jailbreak benchmarks, strong safety defenses, and widely used language models, HMNS attains state-of-the-art attack success rates with fewer queries than prior methods. Ablations confirm that nullspace-constrained injection, residual norm scaling, and iterative re-identification are key to its effectiveness. To our knowledge, this is the first jailbreak method to leverage geometry-aware, interpretability-informed interventions, highlighting a new paradigm for controlled model steering and adversarial safety circumvention.


Key Contributions

  • Novel jailbreak method (HMNS) that identifies causally responsible attention heads, masks their outputs, and injects perturbations in the orthogonal nullspace
  • Closed-loop detection-intervention cycle that re-identifies causal heads across decoding steps for defense resilience
  • State-of-the-art attack success rates across AdvBench, HarmBench, JBB-Behaviors, and StrongReject with fewer queries than prior methods

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetargeted
Datasets
AdvBenchHarmBenchJBB-BehaviorsStrongReject
Applications
safety-aligned language modelsjailbreak resistance evaluation