defense 2026

Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game

0 citations

Published on arXiv

2604.10717

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Substantially reduces chunk recovery rates compared to state-of-the-art baselines while maintaining task performance and inference latency

CanaryRAG

Novel technique introduced

Retrieval-Augmented Generation (RAG) systems augment large language models with external knowledge, yet introduce a critical security vulnerability: RAG Knowledge Base Leakage, wherein adversarial prompts can induce the model to divulge retrieved proprietary content. Recent studies reveal that such leakage can be executed through adaptive and iterative attack strategies (named RAG extraction attack), while effective countermeasures remain notably lacking. To bridge this gap, we propose CanaryRAG, a runtime defense mechanism inspired by stack canaries in software security. CanaryRAG embeds carefully designed canary tokens into retrieved chunks and reformulates RAG extraction defense as a dual-path runtime integrity game. Leakage is detected in real time whenever either the target or oracle path violates its expected canary behavior, including under adaptive suppression and obfuscation. Extensive evaluations against existing attacks demonstrate that CanaryRAG provides robust defense, achieving substantially lower chunk recovery rates than state-of-the-art baselines while imposing negligible impact on task performance and inference latency. Moreover, as a plug-and-play solution, CanaryRAG can be seamlessly integrated into arbitrary RAG pipelines without requiring retraining or structural modifications, offering a practical and scalable safeguard for proprietary data.

Key Contributions

Reframes RAG extraction as a runtime integrity violation and introduces canary-based detection inspired by software security stack canaries
Proposes CanaryRAG, a plug-and-play defense that embeds canary tokens in retrieved chunks and uses dual-path monitoring to detect leakage
Achieves substantially lower chunk recovery rates than baselines with negligible performance impact and no retraining required

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Applications

rag systemsenterprise assistantscustomer support agentsagentic workflows

Read PDF arXiv

Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization

LeakSealer: A Semisupervised Defense for LLMs Against Prompt Injection and Leakage Attacks

SEAL-Tag: Self-Tag Evidence Aggregation with Probabilistic Circuits for PII-Safe Retrieval-Augmented Generation

Activation-Space Anchored Access Control for Multi-Class Permission Reasoning in Large Language Models

Adaptive Backtracking for Privacy Protection in Large Language Models

Contextualized Privacy Defense for LLM Agents

DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher