MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security
Yanrui Du 1, Fenglei Fan 2, Sendong Zhao 1, Jiawei Cao 1, Ting Liu 1, Bing Qin 1
Published on arXiv
2509.06807
Transfer Learning Attack
OWASP ML Top 10 — ML07
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
MoGU_v2 advances the Pareto frontier between LLM usability and security across diverse LLM families, including mainstream, on-device, and reasoning models, while recovering safety degraded by instruction fine-tuning via a simple data-mix strategy.
MoGU_v2
Novel technique introduced
As Large Language Models (LLMs) increasingly permeate human life, their security has emerged as a critical concern, particularly their ability to maintain harmless responses to malicious instructions. Although extensive methods have improved LLMs' security, they often lead to conservative, rejection-oriented responses that compromise practical usability. This presents a key challenge: how to advance the Pareto frontier between LLMs' usability and security, rather than necessitate a trade-off between them. To address this, we propose the MoGU framework, in which the intra-layer router dynamically allocates weights by sensing hidden states, thereby balancing the contributions of security-optimized and usability-optimized variants. Despite its initial potential, the MoGU framework faces limitations such as parameter redundancy and performance bottlenecks. To overcome these, we further propose an improved MoGU_v2 framework that establishes a tighter coupling between the routers and hidden states. In MoGU_v2, routers are embedded only in layers encoding highly classifiable security features, and backbone modules are activated during router optimization to enable bidirectional adaptation. MoGU_V2 exhibits strong adaptability and stable improvements across various series of LLMs, including mainstream LLMs serving as brains in various applications, on-device LLMs optimized for resource-constrained scenarios, and reasoning LLMs tailored for user interpretability. Meanwhile, even facing risks introduced by Instruction Fine-tuning, MoGU_v2 can easily restore security without compromising the task performance gains via a simple data-mix strategy. These comprehensive improvements highlight MoGU_V2 as a robust and versatile solution for mitigating security risks in real-world applications.
Key Contributions
- MoGU_v2 framework embedding routers only in layers with highly classifiable security features, reducing parameter redundancy compared to original MoGU
- Bidirectional adaptation mechanism that activates backbone modules during router optimization to tighten coupling between routers and hidden states
- Data-mix strategy that restores safety after Instruction Fine-tuning without sacrificing task performance gains, applicable across mainstream, on-device, and reasoning LLMs
🛡️ Threat Analysis
The paper explicitly addresses security risks introduced by Instruction Fine-tuning (IFT), where fine-tuning on task data erodes safety alignment, and proposes a data-mix strategy to restore security — directly targeting the transfer learning attack vector where fine-tuning undermines embedded safety.