Nicholas Crispino

Papers in Database (1)

attack arXiv Sep 16, 2025 · Sep 2025

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Vincent Siu, Nathan W. Henry, Nicholas Crispino et al. · University of California

Isolates concept-specific refusal vectors to surgically bypass LLM safety on targeted topics like WMDs using ~12 examples

Prompt Injection nlp
PDF