defense 2026

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

Songping Peng 1, Zhiheng Zhang 2, Daojian Zeng 1, Lincheng Jiang 3, Xieping Gao 1

0 citations

α

Published on arXiv

2604.12384

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

CWAC achieves lowest harmful scores across four LLMs while maintaining fine-tuning accuracy, substantially outperforming baselines even under high harmful data ratios

CWAC

Novel technique introduced


Safety alignment in Large Language Models (LLMs) remains highly fragile during fine-tuning, where even benign adaptation can degrade pre-trained refusal behaviors and enable harmful responses. Existing defenses typically constrain either weights or activations in isolation, without considering their coupled effects on safety. In this paper, we first theoretically demonstrate that constraining either weights or activations alone is insufficient for safety preservation. To robustly preserve safety alignment, we propose Coupled Weight and Activation Constraints (CWAC), a novel approach that simultaneously enforces a precomputed safety subspace on weight updates and applies targeted regularization to safety-critical features identified by sparse autoencoders. Extensive experiments across four widely used LLMs and diverse downstream tasks show that CWAC consistently achieves the lowest harmful scores with minimal impact on fine-tuning accuracy, substantially outperforming strong baselines even under high harmful data ratios.


Key Contributions

  • Theoretical proof that single-level constraints (weight-only or activation-only) are insufficient for safety preservation
  • CWAC method coupling weight subspace projection with activation regularization via sparse autoencoders
  • Extensive validation across four LLMs showing lowest harmful scores with minimal accuracy impact

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_time
Applications
llm fine-tuningsafety alignment preservation