Yuexiao Liu

h-index: 2 43 citations 8 papers (total)

Papers in Database (1)

attack arXiv Oct 17, 2025 · Oct 2025

HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment

Yuexiao Liu, Lijun Li, Xingjun Wang et al. · Tsinghua University · Shanghai Artificial Intelligence Laboratory

Exploits RLVR fine-tuning with 64 harmful prompts to rapidly reverse LLM safety alignment at 96% attack success rate

Transfer Learning Attack nlp
1 citations 1 influentialPDF Code