Latest papers

2 papers
defense arXiv Mar 24, 2026 · 13d ago

SafeSeek: Universal Attribution of Safety Circuits in Language Models

Miao Yu, Siyuan Fu, Moayad Aloqaily et al. · University of Science and Technology of China · Squirrel AI Learning +4 more

Mechanistic interpretability framework identifying sparse safety circuits in LLMs for backdoor removal and alignment preservation

Model Poisoning Input Manipulation Attack Prompt Injection nlp
PDF
defense arXiv Aug 1, 2025 · Aug 2025

DBLP: Noise Bridge Consistency Distillation For Efficient And Reliable Adversarial Purification

Chihan Huang, Belal Alsinglawi, Islam Al-qudah · Zayed University · Nanjing University of Science and Technology +1 more

Distills diffusion purification into a latent consistency model enabling real-time adversarial input cleaning with SOTA robustness

Input Manipulation Attack vision
PDF