Attention to Detail: Global-Local Attention for High-Resolution AI-Generated Image Detection
Published on arXiv
2601.00141
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
GLASS outperforms standard transfer learning across ViT, ResNet, and ConvNeXt backbones on AI-generated image detection while remaining computationally feasible by sampling crops rather than exhaustively tiling the full image.
GLASS (Global-Local Attention with Stratified Sampling)
Novel technique introduced
The rapid development of generative AI has made AI-generated images increasingly realistic and high-resolution. Most AI-generated image detection architectures typically downsample images before inputting them into models, risking the loss of fine-grained details. This paper presents GLASS (Global-Local Attention with Stratified Sampling), an architecture that combines a globally resized view with multiple randomly sampled local crops. These crops are original-resolution regions efficiently selected through spatially stratified sampling and aggregated using attention-based scoring. GLASS can be integrated into vision models to leverage both global and local information in images of any size. Vision Transformer, ResNet, and ConvNeXt models are used as backbones, and experiments show that GLASS outperforms standard transfer learning by achieving higher predictive performance within feasible computational constraints.
Key Contributions
- GLASS architecture: a two-stream global+local design using spatially stratified random crop sampling at original resolution to preserve fine-grained details lost by standard downsampling
- Attention-based aggregation mechanism (additive attention scoring) that weights local crops by informativeness before combining with the global feature stream
- Comprehensive evaluation of GLASS across three backbone families (ViT, ResNet, ConvNeXt) showing consistent improvement over standard transfer learning baselines
🛡️ Threat Analysis
Directly contributes a novel AI-generated image detection architecture (GLASS). The paper's entire contribution is authenticating whether images are AI-generated, which falls squarely under output integrity and synthetic content detection in ML09. This is a novel detection architecture, not a domain application of an existing detector.