defense 2026

Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

Shayan Ali Hassan ¹, Tao Ni ¹, Zafar Ayyub Qazi ^2,1, Marco Canini ¹

¹ KAUST

² LUMS

0 citations · 53 references · arXiv (Cornell University)

Published on arXiv

2602.08062

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

BAGEL achieves F1=0.92 using 5 ensemble members (430M parameters total), matching or exceeding billion-parameter guardrails like OpenAI Moderation API and ShieldGemma while remaining robust through nine incremental updates for new attack types.

BAGEL (Bootstrap AGgregated Ensemble Layer)

Novel technique introduced

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and generation. However, these systems remain susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer limited transparency and adapt poorly to evolving threats, while white-box approaches using large LLM judges impose prohibitive computational costs and require expensive retraining for new attacks. Current systems force designers to choose between performance, efficiency, and adaptability. To address these challenges, we present BAGEL (Bootstrap AGgregated Ensemble Layer), a modular, lightweight, and incrementally updatable framework for malicious prompt detection. BAGEL employs a bootstrap aggregation and mixture of expert inspired ensemble of fine-tuned models, each specialized on a different attack dataset. At inference, BAGEL uses a random forest router to identify the most suitable ensemble member, then applies stochastic selection to sample additional members for prediction aggregation. When new attacks emerge, BAGEL updates incrementally by fine-tuning a small prompt-safety classifier (86M parameters) and adding the resulting model to the ensemble. BAGEL achieves an F1 score of 0.92 by selecting just 5 ensemble members (430M parameters), outperforming OpenAI Moderation API and ShieldGemma which require billions of parameters. Performance remains robust after nine incremental updates, and BAGEL provides interpretability through its router's structural features. Our results show ensembles of small finetuned classifiers can match or exceed billion-parameter guardrails while offering the adaptability and efficiency required for production systems.

Key Contributions

Bootstrap-aggregated ensemble of small (86M parameter) fine-tuned classifiers with a random forest router that selects the most suitable expert per prompt
Incremental update mechanism that adds new expert models for emerging attack types without retraining the full system
Achieves F1=0.92 with only 430M total parameters, outperforming OpenAI Moderation API and ShieldGemma while providing interpretability via the router's structural features

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

ToxicChatJailbreakBench

Applications

llm guardrailsjailbreak detectionprompt injection detectioncontent moderation

Read PDF arXiv DOI Code

Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance

RedVisor: Reasoning-Aware Prompt Injection Defense via Zero-Copy KV Cache Reuse

Proactive defense against LLM Jailbreak

Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

From static to adaptive: immune memory-based jailbreak detection for large language models

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Defend LLMs Through Self-Consciousness