Xuezhi Cao

Papers in Database (2)

attack arXiv Nov 1, 2025 · Nov 2025

Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack

Peng Ding, Jun Kuang, Wen Sun et al. · Nanjing University · Meituan

Jailbreaks LLMs via minimal intent-shifting text edits, bypassing safety filters with natural human-readable prompts

Prompt Injection nlp
PDF Code
attack arXiv Sep 18, 2025 · Sep 2025

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

Siyu Yan, Long Zeng, Xuecheng Wu et al. · East China Normal University · Xi’an Jiaotong University +2 more

Attacks multi-turn LLM safety via MCTS-guided frame semantic trajectories; defends with early-intervention dialogue alignment

Prompt Injection nlp
PDF Code