Ranjan Satapathy

Papers in Database (2)

defense arXiv Aug 16, 2025 · Aug 2025

Mitigating Jailbreaks with Intent-Aware LLMs

Wei Jie Yeo, Ranjan Satapathy, Erik Cambria · Nanyang Technological University · A*STAR

Fine-tunes LLMs to infer instruction intent before responding, reducing all jailbreak attack categories below 50% success rate

Prompt Injection nlp
PDF Code
attack arXiv Sep 7, 2025 · Sep 2025

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah et al. · Singapore University of Technology and Design · Nanyang Technological University +2 more

Ablates SAE latent features mediating refusal in LLMs to produce mechanistically-grounded jailbreaks via a three-stage pipeline

Prompt Injection nlp
PDF