Cheng Wang

benchmark arXiv Sep 4, 2025 · Sep 2025

Cheng Wang, Zeming Wei, Qin Liu et al. · National University of Singapore · Peking University +1 more

Probing-based LLM safety detectors learn surface patterns not semantic harm, failing badly on out-of-distribution malicious inputs

Prompt Injection nlp

Papers in Database (1)