Hanghang Tong

h-index: 6 153 citations 18 papers (total)

Papers in Database (1)

defense arXiv Jan 7, 2026 · 12w ago

ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

Xiao Lin, Philip Li, Zhichen Zeng et al. · University of Illinois Urbana-Champaign · Visa

Defends LLMs against jailbreaks by amplifying internal layer/module/token feature discrepancies to detect attacks without training examples

Prompt Injection nlp
2 citations PDF