Muhao Chen

Papers in Database (2)

benchmark arXiv Sep 4, 2025 · Sep 2025

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

Cheng Wang, Zeming Wei, Qin Liu et al. · National University of Singapore · Peking University +1 more

Probing-based LLM safety detectors learn surface patterns not semantic harm, failing badly on out-of-distribution malicious inputs

Prompt Injection nlp
PDF Code
benchmark arXiv Apr 1, 2026 · 5d ago

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models

Weidi Luo, Xiaofei Wen, Tenghao Huang et al. · University of Georgia · University of California +3 more

Benchmark and guardrail for detecting jailbreak attacks that bypass LLM safety alignment in food safety domain

Prompt Injection nlp
PDF Code