Yuheng Zhang

Papers in Database (1)

attack arXiv Apr 3, 2026 · 5d ago

Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

Yuheng Zhang, Mingyue Huo, Minghao Zhu et al. · University of Illinois Urbana-Champaign · University of Massachusetts Amherst

Token-space adversarial attack on RLHF reward models that bypasses semantic constraints to generate nonsensical high-reward outputs

Input Manipulation Attack nlp
PDF