defense 2025

Don't Walk the Line: Boundary Guidance for Filtered Generation

Sarah Ball 1,2, Andreas Haupt 3

1 citations · 52 references · arXiv

α

Published on arXiv

2510.11834

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Boundary Guidance achieves Pareto improvements in safety and helpfulness across most model scales (0.5B to 14B), with the boundary margin term alone often sufficient for larger models.

Boundary Guidance

Novel technique introduced


Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak, ambiguous, and longcontext prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.


Key Contributions

  • Formalizes why compound-system utility degrades near a safety classifier's decision boundary, motivating boundary-avoiding training objectives.
  • Introduces Boundary Guidance, an RL fine-tuning method that augments a utility reward with a margin term rewarding outputs whose safety score is far from the classifier cutoff.
  • Empirical evaluation across model scales (0.5B–14B) showing consistent reduction in judged harmfulness while preserving or improving helpfulness, including Pareto improvements over baselines.

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timetraining_time
Datasets
custom benchmark of jailbreak, ambiguous, and long-context prompts
Applications
llm safety alignmentfiltered generationjailbreak defense