attack 2026

Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors

0 citations

Published on arXiv

2604.12359

Model Poisoning

OWASP ML Top 10 — ML10

AI Supply Chain Attacks

OWASP ML Top 10 — ML06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves high triggered jailbreak success while preserving safety on clean inputs and maintaining general utility across multiple safety-aligned LLMs

Null-Space Constrained Activation Steering Backdoor

Novel technique introduced

Safety-aligned large language models (LLMs) are increasingly deployed in real-world pipelines, yet this deployment also enlarges the supply-chain attack surface: adversaries can distribute backdoored checkpoints that behave normally under standard evaluation but jailbreak when a hidden trigger is present. Recent post-hoc weight-editing methods offer an efficient approach to injecting such backdoors by directly modifying model weights to map a trigger to an attacker-specified response. However, existing methods typically optimize a token-level mapping that forces an affirmative prefix (e.g., ``Sure''), which does not guarantee sustained harmful output -- the model may begin with apparent agreement yet revert to safety-aligned refusal within a few decoding steps. We address this reliability gap by shifting the backdoor objective from surface tokens to internal representations. We extract a steering vector that captures the difference between compliant and refusal behaviors, and compile it into a persistent weight modification that activates only when the trigger is present. To preserve stealthiness and benign utility, we impose a null-space constraint so that the injected edit remains dormant on clean inputs. The method is efficient, requiring only a small set of examples and admitting a closed-form solution. Across multiple safety-aligned LLMs and jailbreak benchmarks, our method achieves high triggered attack success while maintaining non-triggered safety and general utility.

Key Contributions

Novel weight-editing backdoor method that compiles activation steering vectors into persistent weight modifications using null-space constraints
Shifts backdoor objective from surface token-level mapping to internal representation steering for reliable sustained harmful output
Achieves high triggered attack success while maintaining non-triggered safety and benign utility with closed-form solution requiring minimal examples

🛡️ Threat Analysis

AI Supply Chain Attacks

Explicitly targets supply-chain attack vector by distributing backdoored checkpoints that pass standard evaluation but activate malicious behavior when triggered.

Model Poisoning

Core contribution is a backdoor/trojan attack that embeds hidden trigger-activated behavior (jailbreak) in LLM weights while preserving normal operation on clean inputs.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeinference_timetargetedwhite_box

Datasets

jailbreak benchmarks

Applications

safety-aligned llmschatbot

Read PDF arXiv

Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Colluding LoRA: A Composite Attack on LLM Safety Alignment

When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion

Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor

Attacking LLMs and AI Agents: Advertisement Embedding Attacks Against Large Language Models

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning

Stateless Yet Not Forgetful: Implicit Memory as a Hidden Channel in LLMs

Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems