Latest papers

1 papers
defense arXiv Dec 18, 2025 · Dec 2025

Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

Iker García-Ferrero, David Montero, Roman Orus · Multiverse Computing

Activation steering method that surgically removes political over-refusal in LLMs while preserving safety alignment for harmful content

Prompt Injection nlp
PDF