A Lightweight Explainable Guardrail for Prompt Safety
Md Asiful Islam , Mihai Surdeanu
Published on arXiv
2602.15853
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Achieves equivalent or better prompt classification and explainability performance than larger SOTA guardrails both in-domain and out-of-domain across three datasets, while being considerably smaller in model size.
LEG (Lightweight Explainable Guardrail)
Novel technique introduced
We propose a lightweight explainable guardrail (LEG) method for the classification of unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained using synthetic data for explainability, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG's training process uses a novel loss that captures global explanation signals and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches. If accepted, we will release all models and the annotated dataset publicly.
Key Contributions
- Multi-task learning architecture that jointly trains a prompt safety classifier and a word-level explanation classifier for interpretability.
- Novel synthetic data generation strategy for explainability training that counteracts LLM confirmation bias.
- Novel loss function combining cross-entropy, focal loss, and uncertainty-based weighting with global explanation signal as weak supervision.