Latest papers

4 papers
defense arXiv Mar 24, 2026 · 13d ago

SafeSeek: Universal Attribution of Safety Circuits in Language Models

Miao Yu, Siyuan Fu, Moayad Aloqaily et al. · University of Science and Technology of China · Squirrel AI Learning +4 more

Mechanistic interpretability framework identifying sparse safety circuits in LLMs for backdoor removal and alignment preservation

Model Poisoning Input Manipulation Attack Prompt Injection nlp
PDF
attack arXiv Dec 1, 2025 · Dec 2025

EmoRAG: Evaluating RAG Robustness to Symbolic Perturbations

Xinyun Zhou, Xinfeng Li, Yinan Peng et al. · Zhejiang University · Hengxin Technology +5 more

Emoticon injection into RAG queries poisons retrieval with ~100% success, exposing a critical LLM-integrated system vulnerability

Input Manipulation Attack Prompt Injection nlp
1 citations PDF
defense arXiv Oct 11, 2025 · Oct 2025

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

Liang Lin, Miao Yu, Moayad Aloqaily et al. · Nanyang Technological University · University of Science and Technology of China +4 more

Defends LLMs against unknown backdoors by intentionally injecting known triggers to aggregate and then purge backdoor representations

Model Poisoning nlp
PDF
defense arXiv Sep 26, 2025 · Sep 2025

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Miao Yu, Zhenhong Zhou, Moayad Aloqaily et al. · University of Science and Technology of China · Nanyang Technological University +5 more

Mechanistic interpretability framework identifies backdoor-responsible attention heads in LLMs, enabling surgical neutralization or amplification of backdoor behavior

Model Poisoning nlp
1 citations PDF