defense arXiv Oct 13, 2025 · Oct 2025
Sarah Ball, Andreas Haupt · Ludwig-Maximilians-Universität München · Munich Center for Machine Learning +1 more
RL fine-tuning steers LLM outputs away from safety classifier margins to reduce jailbreak bypass and over-refusal simultaneously
Prompt Injection nlp
Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak, ambiguous, and longcontext prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.
llm transformer Ludwig-Maximilians-Universität München · Munich Center for Machine Learning · Stanford University