Fandong Meng

h-index: 39 6,648 citations 241 papers (total)

Papers in Database (1)

benchmark arXiv Feb 4, 2026 · 9w ago

The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

Zhexin Zhang, Yida Lu, Junfeng Fang et al. · Tsinghua University · National University of Singapore +1 more

First systematic taxonomy of training-time implicit safety risks in RL-trained LLMs, showing risky behaviors in 74.4% of runs

Model Skewing Excessive Agency nlpreinforcement-learning
PDF