Latest papers

1 papers
defense arXiv Jan 7, 2026 · 12w ago

ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

Xiao Lin, Philip Li, Zhichen Zeng et al. · University of Illinois Urbana-Champaign · Visa

Defends LLMs against jailbreaks by amplifying internal layer/module/token feature discrepancies to detect attacks without training examples

Prompt Injection nlp
2 citations PDF