defense 2025

GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models

Javad Forough , Mohammad M Maheri , Hamed Haddadi

1 citations · 33 references · arXiv

α

Published on arXiv

2509.23037

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

GuardNet achieves 99.8% prompt-level F1 on LLM-Fuzzer (up from 66.4%) and improves token-level F1 from 48–75% to 74–91% with IoU gains up to +28% over prior defenses.

GuardNet

Novel technique introduced


Large Language Models (LLMs) are increasingly susceptible to jailbreak attacks, which are adversarial prompts that bypass alignment constraints and induce unauthorized or harmful behaviors. These vulnerabilities undermine the safety, reliability, and trustworthiness of LLM outputs, posing critical risks in domains such as healthcare, finance, and legal compliance. In this paper, we propose GuardNet, a hierarchical filtering framework that detects and filters jailbreak prompts prior to inference. GuardNet constructs structured graphs that combine sequential links, syntactic dependencies, and attention-derived token relations to capture both linguistic structure and contextual patterns indicative of jailbreak behavior. It then applies graph neural networks at two levels: (i) a prompt-level filter that detects global adversarial prompts, and (ii) a token-level filter that pinpoints fine-grained adversarial spans. Extensive experiments across three datasets and multiple attack settings show that GuardNet substantially outperforms prior defenses. It raises prompt-level F$_1$ scores from 66.4\% to 99.8\% on LLM-Fuzzer, and from 67-79\% to over 94\% on PLeak datasets. At the token level, GuardNet improves F$_1$ from 48-75\% to 74-91\%, with IoU gains up to +28\%. Despite its structural complexity, GuardNet maintains acceptable latency and generalizes well in cross-domain evaluations, making it a practical and robust defense against jailbreak threats in real-world LLM deployments.


Key Contributions

  • Hierarchical GNN filtering framework operating at both prompt-level (global adversarial intent) and token-level (fine-grained adversarial span localization)
  • Hybrid token graph construction combining sequential links, syntactic dependency structures, and multi-head attention-derived token relations to capture jailbreak patterns
  • Substantially outperforms prior jailbreak defenses: F1 raised from 66.4% to 99.8% on LLM-Fuzzer and from 67–79% to over 94% on PLeak datasets

🛡️ Threat Analysis


Details

Domains
nlpgraph
Model Types
llmtransformergnn
Threat Tags
black_boxinference_time
Datasets
LLM-FuzzerPLeak
Applications
llm safetyjailbreak detectioncontent moderation