defense 2026

TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection

Ahmed Abdullah , Nikolas Ebert , Oliver Wasenmüller

0 citations

α

Published on arXiv

2604.26772

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Best VFM outperforms original CLIP by more than 12% accuracy on AI-generated image detection, establishing new SOTA on two challenging in-the-wild benchmarks

TAP (Tunable Attention Pooling)

Novel technique introduced


Recent methods demonstrate that large-scale pretrained models, such as CLIP vision transformers, effectively detect AI-generated images (AIGIs) from unseen generative models when used as feature extractors. Many state-of-the-art methods for AI-generated image detection build upon the original CLIP-ViT to enhance this generalization. Since CLIP's release, numerous vision foundation models (VFMs) have emerged, incorporating architectural improvements and different training paradigms. Despite these advances, their potential for AIGI detection and AI image forensics remains largely unexplored. In this work, we present a comprehensive benchmark across multiple VFM families, covering diverse pretraining objectives, input resolutions, and model scales. We systematically evaluate their out-of-the-box performance for detecting fully-generated AI-images and AI-inpainted images, and discover that the best model outperforms the original CLIP by more than 12% in accuracy, beating established approaches in the process. To fully leverage the features of a modern VFM, we propose a simple redesign of the classifier head by utilizing tunable attention pooling (TAP), which aggregates output tokens into a refined global representation. Integrating TAP with the latest VFMs yields substantial performance gains across several AIGI detection benchmarks, establishing a new state-of-the-art on two challenging benchmarks for in-the-wild detection of AI-generated and -inpainted images.


Key Contributions

  • Comprehensive benchmark of vision foundation models (VFMs) for AI-generated image detection across multiple model families and pretraining objectives
  • Proposes tunable attention pooling (TAP) classifier head that aggregates output tokens for improved detection performance
  • Achieves new state-of-the-art on two in-the-wild AIGI detection benchmarks, outperforming CLIP baseline by 12% accuracy

🛡️ Threat Analysis

Output Integrity Attack

Core contribution is detecting AI-generated and AI-inpainted images to verify content authenticity and provenance — directly addresses output integrity for synthetic visual content.


Details

Domains
visionmultimodal
Model Types
transformermultimodal
Threat Tags
inference_time
Applications
ai-generated image detectionai-inpainted image detectionsynthetic media forensics