Latest papers

1 papers
defense arXiv Aug 21, 2025 · Aug 2025

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Peng Ding, Wen Sun, Dailin Li et al. · Meituan Inc. · Dalian University of Technology +1 more

RL defense uses LLMs' own harm-discrimination ability as a reward signal to close the gap between identifying and resisting jailbreaks

Prompt Injection nlp
PDF Code