defense 2026

Multilingual Safety Alignment Via Sparse Weight Editing

Jiaming Liang , Zhaoxin Wang , Handing Wang

0 citations

α

Published on arXiv

2602.22554

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Substantially reduces Attack Success Rate in low-resource languages across 8 languages and two model families with negligible degradation to general reasoning capabilities, requiring no gradient-based training.

Sparse Weight Editing

Novel technique introduced


Large Language Models (LLMs) exhibit significant safety disparities across languages, with low-resource languages (LRLs) often bypassing safety guardrails established for high-resource languages (HRLs) like English. Existing solutions, such as multilingual supervised fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), are computationally expensive and dependent on scarce multilingual safety data. In this work, we propose a novel, training-free alignment framework based on Sparse Weight Editing. Identifying that safety capabilities are localized within a sparse set of safety neurons, we formulate the cross-lingual alignment problem as a constrained linear transformation. We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs, while preserving general utility via a null-space projection constraint. Extensive experiments across 8 languages and multiple model families (Llama-3, Qwen-2.5) demonstrate that our method substantially reduces Attack Success Rate (ASR) in LRLs with negligible impact on general reasoning capabilities, all achieved with a single, data-efficient calculation.


Key Contributions

  • Identifies that LLM safety capabilities are localized in a sparse set of 'safety neurons' and formalizes cross-lingual alignment as a constrained linear transformation
  • Derives a closed-form solution that maps harmful LRL representations to HRL safety subspaces using null-space projection to preserve general utility
  • Provides a training-free, data-efficient, plug-and-play alignment method validated across 8 languages on Llama-3 and Qwen-2.5

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
multilingual safety benchmarks (Llama-3, Qwen-2.5 evaluations across 8 languages)
Applications
multilingual llm safety alignmentcross-lingual jailbreak defense