Amir Abdullah

Papers in Database (1)

attack arXiv Sep 7, 2025 · Sep 2025

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah et al. · Singapore University of Technology and Design · Nanyang Technological University +2 more

Ablates SAE latent features mediating refusal in LLMs to produce mechanistically-grounded jailbreaks via a three-stage pipeline

Prompt Injection nlp
PDF