defense 2025

MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security

Yanrui Du ¹, Fenglei Fan ², Sendong Zhao ¹, Jiawei Cao ¹, Ting Liu ¹, Bing Qin ¹

¹ Harbin Institute of Technology

² City University of Hong Kong

0 citations

Published on arXiv

2509.06807

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MoGU_v2 advances the Pareto frontier between LLM usability and security across diverse LLM families, including mainstream, on-device, and reasoning models, while recovering safety degraded by instruction fine-tuning via a simple data-mix strategy.

MoGU_v2

Novel technique introduced

As Large Language Models (LLMs) increasingly permeate human life, their security has emerged as a critical concern, particularly their ability to maintain harmless responses to malicious instructions. Although extensive methods have improved LLMs' security, they often lead to conservative, rejection-oriented responses that compromise practical usability. This presents a key challenge: how to advance the Pareto frontier between LLMs' usability and security, rather than necessitate a trade-off between them. To address this, we propose the MoGU framework, in which the intra-layer router dynamically allocates weights by sensing hidden states, thereby balancing the contributions of security-optimized and usability-optimized variants. Despite its initial potential, the MoGU framework faces limitations such as parameter redundancy and performance bottlenecks. To overcome these, we further propose an improved MoGU_v2 framework that establishes a tighter coupling between the routers and hidden states. In MoGU_v2, routers are embedded only in layers encoding highly classifiable security features, and backbone modules are activated during router optimization to enable bidirectional adaptation. MoGU_V2 exhibits strong adaptability and stable improvements across various series of LLMs, including mainstream LLMs serving as brains in various applications, on-device LLMs optimized for resource-constrained scenarios, and reasoning LLMs tailored for user interpretability. Meanwhile, even facing risks introduced by Instruction Fine-tuning, MoGU_v2 can easily restore security without compromising the task performance gains via a simple data-mix strategy. These comprehensive improvements highlight MoGU_V2 as a robust and versatile solution for mitigating security risks in real-world applications.

Key Contributions

MoGU_v2 framework embedding routers only in layers with highly classifiable security features, reducing parameter redundancy compared to original MoGU
Bidirectional adaptation mechanism that activates backbone modules during router optimization to tighten coupling between routers and hidden states
Data-mix strategy that restores safety after Instruction Fine-tuning without sacrificing task performance gains, applicable across mainstream, on-device, and reasoning LLMs

🛡️ Threat Analysis

Transfer Learning Attack

The paper explicitly addresses security risks introduced by Instruction Fine-tuning (IFT), where fine-tuning on task data erodes safety alignment, and proposes a data-mix strategy to restore security — directly targeting the transfer learning attack vector where fine-tuning undermines embedded safety.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timetraining_time

Applications

llm safety alignmentchatbotson-device llmsreasoning llms

Read PDF arXiv

MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment

EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

Understanding and Preserving Safety in Fine-Tuned LLMs

MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

Token-level Data Selection for Safe LLM Fine-tuning