defense 2025

KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs

Shuyuan Liu ¹, Jiawei Chen ^1,2, Xiao Yang ³, Hang Su ³, Zhaoxia Yin ¹

¹ East China Normal University

² Zhongguancun Academy

³ Tsinghua University

0 citations · 36 references · arXiv

Published on arXiv

2511.07480

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

KG-DF improves defense against diverse jailbreak attacks while also improving general QA response quality compared to existing black-box defense baselines.

KG-DF

Novel technique introduced

With the widespread application of large language models (LLMs) in various fields, the security challenges they face have become increasingly prominent, especially the issue of jailbreak. These attacks induce the model to generate erroneous or uncontrolled outputs through crafted inputs, threatening the generality and security of the model. Although existing defense methods have shown some effectiveness, they often struggle to strike a balance between model generality and security. Excessive defense may limit the normal use of the model, while insufficient defense may lead to security vulnerabilities. In response to this problem, we propose a Knowledge Graph Defense Framework (KG-DF). Specifically, because of its structured knowledge representation and semantic association capabilities, Knowledge Graph(KG) can be searched by associating input content with safe knowledge in the knowledge base, thus identifying potentially harmful intentions and providing safe reasoning paths. However, traditional KG methods encounter significant challenges in keyword extraction, particularly when confronted with diverse and evolving attack strategies. To address this issue, we introduce an extensible semantic parsing module, whose core task is to transform the input query into a set of structured and secure concept representations, thereby enhancing the relevance of the matching process. Experimental results show that our framework enhances defense performance against various jailbreak attack methods, while also improving the response quality of the LLM in general QA scenarios by incorporating domain-general knowledge.

Key Contributions

KG-DF: a black-box defense framework that matches LLM inputs against a security knowledge graph to identify harmful intent without requiring model internals
Extensible semantic parsing module that converts input queries into structured concept representations, handling syntactic obfuscation (e.g., 'b0mb') that defeats traditional keyword-based KG extraction
Dual benefit of improved jailbreak defense and enhanced general QA quality by incorporating domain-general knowledge into the knowledge graph

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Applications

llm safetyconversational aichatbot security

Read PDF arXiv DOI

KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

Soft Instruction De-escalation Defense

A Call to Action for a Secure-by-Design Generative AI Paradigm

Towards Unsupervised Adversarial Document Detection in Retrieval Augmented Generation Systems

EASE: Practical and Efficient Safety Alignment for Small Language Models

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

The Cost of Thinking: Increased Jailbreak Risk in Large Language Models

LLM Reinforcement in Context