RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification

In this paper, we introduce RAVID, the first framework for AI-generated image detection that leverages visual retrieval-augmented generation (RAG). While RAG methods have shown promise in mitigating factual inaccuracies in foundation models, they have primarily focused on text, leaving visual knowledge underexplored. Meanwhile, existing detection methods, which struggle with generalization and robustness, often rely on low-level artifacts and model-specific features, limiting their adaptability. To address this, RAVID dynamically retrieves relevant images to enhance detection. Our approach utilizes a fine-tuned CLIP image encoder, RAVID CLIP, enhanced with category-related prompts to improve representation learning. We further integrate a vision-language model (VLM) to fuse retrieved images with the query, enriching the input and improving accuracy. Given a query image, RAVID generates an embedding using RAVID CLIP, retrieves the most relevant images from a database, and combines these with the query image to form an enriched input for a VLM (e.g., Qwen-VL or Openflamingo). Experiments on the UniversalFakeDetect benchmark, which covers 19 generative models, show that RAVID achieves state-of-the-art performance with an average accuracy of 93.85%. RAVID also outperforms traditional methods in terms of robustness, maintaining high accuracy even under image degradations such as Gaussian blur and JPEG compression. Specifically, RAVID achieves an average accuracy of 80.27% under degradation conditions, compared to 63.44% for the state-of-the-art model C2P-CLIP, demonstrating consistent improvements in both Gaussian blur and JPEG compression scenarios. The code will be publicly available upon acceptance.

Key Contributions

First visual RAG framework for AI-generated image detection, dynamically retrieving relevant reference images to enrich detection context
RAVID CLIP: fine-tuned CLIP encoder with category-related prompts for improved image representation in synthetic vs. real classification
VLM-based fusion of retrieved images with query image achieving 93.85% accuracy on UniversalFakeDetect (19 generative models) and 80.27% under image degradation vs. 63.44% for prior SOTA C2P-CLIP

🛡️ Threat Analysis

Output Integrity Attack

RAVID is a novel AI-generated image detection architecture — detecting synthetic/deepfake images is a core ML09 output integrity concern. The paper proposes a new detection methodology (visual RAG + fine-tuned CLIP + VLM fusion) rather than applying existing tools to a new domain, qualifying as ML security research under the novel detection architecture criterion.

Details

Domains

visionmultimodal

Model Types

vlmtransformer

Threat Tags

inference_time

Datasets

UniversalFakeDetect

Applications

2026 0 cit.

Output Integrity Attack

92%