Siyuan Fu

Papers in Database (1)

defense arXiv Mar 24, 2026 · 13d ago

SafeSeek: Universal Attribution of Safety Circuits in Language Models

Miao Yu, Siyuan Fu, Moayad Aloqaily et al. · University of Science and Technology of China · Squirrel AI Learning +4 more

Mechanistic interpretability framework identifying sparse safety circuits in LLMs for backdoor removal and alignment preservation

Model Poisoning Input Manipulation Attack Prompt Injection nlp
PDF