defense 2025

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

Yuhao Sun 1, Zhuoer Xu 2, Shiwen Cui 2, Kun Yang 2, Lingyun Yu 1, Yongdong Zhang 1, Hongtao Xie 1

0 citations · 58 references · arXiv

α

Published on arXiv

2510.02194

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

UpSafe°C achieves robust safety improvements against harmful and jailbreak inputs while maintaining competitive general task performance, with safety temperature enabling Pareto-optimal inference-time control.

UpSafe°C

Novel technique introduced


Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques -- including external guardrails, inference-time guidance, and post-training alignment -- each face limitations in balancing safety, utility, and controllability. In this work, we propose UpSafe$^\circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling. Our approach first identifies safety-critical layers and upcycles them into a sparse Mixture-of-Experts (MoE) structure, where the router acts as a soft guardrail that selectively activates original MLPs and added safety experts. We further introduce a two-stage SFT strategy to strengthen safety discrimination while preserving general capabilities. To enable flexible control at inference time, we introduce a safety temperature mechanism, allowing dynamic adjustment of the trade-off between safety and utility. Experiments across multiple benchmarks, base model, and model scales demonstrate that UpSafe$^\circ$C achieves robust safety improvements against harmful and jailbreak inputs, while maintaining competitive performance on general tasks. Moreover, analysis shows that safety temperature provides fine-grained inference-time control that achieves the Pareto-optimal frontier between utility and safety. Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.


Key Contributions

  • Safety-aware upcycling that converts safety-critical LLM layers into sparse MoE structures where the router acts as a trainable soft guardrail distinguishing harmful from benign inputs
  • Two-stage SFT strategy: first trains safety experts on safety data (frozen original MLP), then trains only the router on mixed data to preserve general capabilities
  • Safety temperature mechanism enabling fine-grained inference-time adjustment of the safety-utility trade-off, achieving the Pareto-optimal frontier between the two objectives

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timetraining_timeblack_box
Applications
large language model safetyjailbreak defenseharmful content prevention