Latest papers

2 papers
defense arXiv Mar 12, 2026 · 25d ago

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Zhiyu Xue, Zimo Qi, Guangliang Liu et al. · University of California · Johns Hopkins University +2 more

Analyzes refusal trigger mechanisms in LLM safety alignment to reduce overrefusal while maintaining jailbreak defenses

Prompt Injection nlp
PDF
benchmark arXiv Jan 3, 2025 · Jan 2025

Towards Robust and Accurate Stability Estimation of Local Surrogate Models in Text-based Explainable AI

Christopher Burger, Charles Walter, Thai Le et al. · University of Mississippi · Indiana University +1 more

Proposes better similarity metrics for measuring adversarial robustness of NLP explanation methods like LIME using synonymity weighting

Input Manipulation Attack nlp
1 citations PDF