defense 2025

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Peng Ding ¹, Wen Sun ², Dailin Li ³, Wei Zou ¹, Jiaming Wang ³, Jiajun Chen ³, Shujian Huang ³

¹ Meituan Inc.

² Dalian University of Technology

³ Nanjing University

0 citations

Published on arXiv

2508.15648

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SDGO significantly outperforms both prompt-based and training-based jailbreak defenses and generalizes robustly to out-of-distribution jailbreaking attacks while maintaining general-purpose helpfulness.

SDGO (Self-Discrimination-Guided Optimization)

Novel technique introduced

Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model's inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model's own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs' discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model's generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at https://github.com/NJUNLP/SDGO.

Key Contributions

Identifies a safety inconsistency in LLMs: they are more effective at discriminating harmful requests than at refusing them as generators
Proposes SDGO, a reinforcement learning framework that uses the model's own discrimination capability as a reward signal for safety alignment, requiring no external models or annotated data
Demonstrates robust out-of-distribution generalization against novel jailbreaking attacks while preserving helpfulness on general benchmarks

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformerrl

Threat Tags

inference_timeblack_box

Applications

large language modelsconversational ai assistants

Read PDF arXiv Code

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Learning to Extract Context for Context-Aware LLM Inference

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Safety Alignment of LMs via Non-cooperative Games

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software

SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG