defense 2026

Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

Shayan Ali Hassan 1, Tao Ni 1, Zafar Ayyub Qazi 2,1, Marco Canini 1

0 citations · 53 references · arXiv (Cornell University)

α

Published on arXiv

2602.08062

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

BAGEL achieves F1=0.92 using 5 ensemble members (430M parameters total), matching or exceeding billion-parameter guardrails like OpenAI Moderation API and ShieldGemma while remaining robust through nine incremental updates for new attack types.

BAGEL (Bootstrap AGgregated Ensemble Layer)

Novel technique introduced


Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and generation. However, these systems remain susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer limited transparency and adapt poorly to evolving threats, while white-box approaches using large LLM judges impose prohibitive computational costs and require expensive retraining for new attacks. Current systems force designers to choose between performance, efficiency, and adaptability. To address these challenges, we present BAGEL (Bootstrap AGgregated Ensemble Layer), a modular, lightweight, and incrementally updatable framework for malicious prompt detection. BAGEL employs a bootstrap aggregation and mixture of expert inspired ensemble of fine-tuned models, each specialized on a different attack dataset. At inference, BAGEL uses a random forest router to identify the most suitable ensemble member, then applies stochastic selection to sample additional members for prediction aggregation. When new attacks emerge, BAGEL updates incrementally by fine-tuning a small prompt-safety classifier (86M parameters) and adding the resulting model to the ensemble. BAGEL achieves an F1 score of 0.92 by selecting just 5 ensemble members (430M parameters), outperforming OpenAI Moderation API and ShieldGemma which require billions of parameters. Performance remains robust after nine incremental updates, and BAGEL provides interpretability through its router's structural features. Our results show ensembles of small finetuned classifiers can match or exceed billion-parameter guardrails while offering the adaptability and efficiency required for production systems.


Key Contributions

  • Bootstrap-aggregated ensemble of small (86M parameter) fine-tuned classifiers with a random forest router that selects the most suitable expert per prompt
  • Incremental update mechanism that adds new expert models for emerging attack types without retraining the full system
  • Achieves F1=0.92 with only 430M total parameters, outperforming OpenAI Moderation API and ShieldGemma while providing interpretability via the router's structural features

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
ToxicChatJailbreakBench
Applications
llm guardrailsjailbreak detectionprompt injection detectioncontent moderation