Hongliang Liu

Papers in Database (1)

tool arXiv Apr 30, 2026 · 21d ago

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

Hongliang Liu, Tung-Ling Li, Yuhao Wu · Palo Alto Networks

Two-pass perturbation probing identifies 50-neuron safety refusal circuits in aligned LLMs, enabling precision ablation interventions

Prompt Injection nlp
PDF