Shujian Huang

Papers in Database (2)

defense arXiv Aug 21, 2025 · Aug 2025

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Peng Ding, Wen Sun, Dailin Li et al. · Meituan Inc. · Dalian University of Technology +1 more

RL defense uses LLMs' own harm-discrimination ability as a reward signal to close the gap between identifying and resisting jailbreaks

Prompt Injection nlp
PDF Code
attack arXiv Nov 1, 2025 · Nov 2025

Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack

Peng Ding, Jun Kuang, Wen Sun et al. · Nanjing University · Meituan

Jailbreaks LLMs via minimal intent-shifting text edits, bypassing safety filters with natural human-readable prompts

Prompt Injection nlp
PDF Code