defense 2025

MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security

Yanrui Du 1, Fenglei Fan 2, Sendong Zhao 1, Jiawei Cao 1, Ting Liu 1, Bing Qin 1

0 citations

α

Published on arXiv

2509.06807

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MoGU_v2 advances the Pareto frontier between LLM usability and security across diverse LLM families, including mainstream, on-device, and reasoning models, while recovering safety degraded by instruction fine-tuning via a simple data-mix strategy.

MoGU_v2

Novel technique introduced


As Large Language Models (LLMs) increasingly permeate human life, their security has emerged as a critical concern, particularly their ability to maintain harmless responses to malicious instructions. Although extensive methods have improved LLMs' security, they often lead to conservative, rejection-oriented responses that compromise practical usability. This presents a key challenge: how to advance the Pareto frontier between LLMs' usability and security, rather than necessitate a trade-off between them. To address this, we propose the MoGU framework, in which the intra-layer router dynamically allocates weights by sensing hidden states, thereby balancing the contributions of security-optimized and usability-optimized variants. Despite its initial potential, the MoGU framework faces limitations such as parameter redundancy and performance bottlenecks. To overcome these, we further propose an improved MoGU_v2 framework that establishes a tighter coupling between the routers and hidden states. In MoGU_v2, routers are embedded only in layers encoding highly classifiable security features, and backbone modules are activated during router optimization to enable bidirectional adaptation. MoGU_V2 exhibits strong adaptability and stable improvements across various series of LLMs, including mainstream LLMs serving as brains in various applications, on-device LLMs optimized for resource-constrained scenarios, and reasoning LLMs tailored for user interpretability. Meanwhile, even facing risks introduced by Instruction Fine-tuning, MoGU_v2 can easily restore security without compromising the task performance gains via a simple data-mix strategy. These comprehensive improvements highlight MoGU_V2 as a robust and versatile solution for mitigating security risks in real-world applications.


Key Contributions

  • MoGU_v2 framework embedding routers only in layers with highly classifiable security features, reducing parameter redundancy compared to original MoGU
  • Bidirectional adaptation mechanism that activates backbone modules during router optimization to tighten coupling between routers and hidden states
  • Data-mix strategy that restores safety after Instruction Fine-tuning without sacrificing task performance gains, applicable across mainstream, on-device, and reasoning LLMs

🛡️ Threat Analysis

Transfer Learning Attack

The paper explicitly addresses security risks introduced by Instruction Fine-tuning (IFT), where fine-tuning on task data erodes safety alignment, and proposes a data-mix strategy to restore security — directly targeting the transfer learning attack vector where fine-tuning undermines embedded safety.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timetraining_time
Applications
llm safety alignmentchatbotson-device llmsreasoning llms