Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
ChenYu Wu , Yi Wang 1, Yang Liao 2
Published on arXiv
2510.15017
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
The honeypot guardrail system significantly disrupts multi-turn jailbreak success rates (using ActorAttack against GPT-4o) while preserving benign user experience on the MHJ dataset.
Honeypot Utility Score (HUS)
Novel technique introduced
Large language models (LLMs) are increasingly vulnerable to multi-turn jailbreak attacks, where adversaries iteratively elicit harmful behaviors that bypass single-turn safety filters. Existing defenses predominantly rely on passive rejection, which either fails against adaptive attackers or overly restricts benign users. We propose a honeypot-based proactive guardrail system that transforms risk avoidance into risk utilization. Our framework fine-tunes a bait model to generate ambiguous, non-actionable but semantically relevant responses, which serve as lures to probe user intent. Combined with the protected LLM's safe reply, the system inserts proactive bait questions that gradually expose malicious intent through multi-turn interactions. We further introduce the Honeypot Utility Score (HUS), measuring both the attractiveness and feasibility of bait responses, and use a Defense Efficacy Rate (DER) for balancing safety and usability. Initial experiment on MHJ Datasets with recent attack method across GPT-4o show that our system significantly disrupts jailbreak success while preserving benign user experience.
Key Contributions
- Honeypot-based proactive guardrail framework that fine-tunes a 'bait model' to generate ambiguous, non-actionable decoy responses designed to expose malicious user intent across multi-turn interactions
- Honeypot Utility Score (HUS) that jointly measures the attractiveness (A-score) and feasibility/harmlessness (F-score) of bait responses to ensure lures are enticing without being dangerous
- Defense Efficacy Rate (DER) metric that balances jailbreak disruption against benign user experience preservation