defense 2026

Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models

Weiwei Qi ¹, Zefeng Wu ¹, Tianhang Zheng ^1,2, Zikang Zhang ¹, Xiaojun Jia ³, Zhan Qin ^1,2, Kui Ren ^1,2

¹ Zhejiang University

² Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security

³ Nanyang Technological University

0 citations

Published on arXiv

2604.08297

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SET reduces attack success rates by over 50% updating only 1% of weights in 100 iterations; SPA limits safety degradation to within 1% after 1,000-iteration instruction tuning

ESI (Expected Safety Impact)

Novel technique introduced

Ensuring Large Language Model (LLM) safety is crucial, yet the lack of a clear understanding about safety mechanisms hinders the development of precise and reliable methodologies for safety intervention across diverse tasks. To better understand and control LLM safety, we propose the Expected Safety Impact (ESI) framework for quantifying how different parameters affect LLM safety. Based on ESI, we reveal distinct safety-critical patterns across different LLM architectures: In dense LLMs, many safety-critical parameters are located in value matrices (V) and MLPs in middle layers, whereas in Mixture-of-Experts (MoE) models, they shift to the late-layer MLPs. Leveraging ESI, we further introduce two targeted intervention paradigms for safety enhancement and preservation, i.e., Safety Enhancement Tuning (SET) and Safety Preserving Adaptation (SPA). SET can align unsafe LLMs by updating only a few safety-critical parameters, effectively enhancing safety while preserving original performance. SPA safeguards well-aligned LLMs during capability-oriented intervention (e.g., instruction tuning) by preventing disruption of safety-critical weights, allowing the LLM to acquire new abilities and maintain safety capabilities. Extensive evaluations on different LLMs demonstrate that SET can reduce the attack success rates of unaligned LLMs by over 50% with only a 100-iteration update on 1% of model weights. SPA can limit the safety degradation of aligned LLMs within 1% after a 1,000-iteration instruction fine-tuning on different tasks. Our code is available at: https://github.com/ZJU-LLM-Safety/SafeWeights-ACL.

Key Contributions

Expected Safety Impact (ESI) framework for quantifying parameter-level safety criticality in LLMs using gradient analysis and parameter standard deviation
Safety Enhancement Tuning (SET) that aligns unsafe LLMs by updating only 1% of safety-critical weights, reducing attack success rates by 50%+ in 100 iterations
Safety Preserving Adaptation (SPA) that maintains safety during instruction fine-tuning by protecting critical weights, limiting degradation to <1% after 1,000 iterations

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timetraining_time

Applications

llm safety alignmentfine-tuning safety preservation

Read PDF arXiv Code

Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

Don't Walk the Line: Boundary Guidance for Filtered Generation

AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models

Reasoning Up the Instruction Ladder for Controllable Language Models

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs