benchmark 2026

AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

Jiaqi Wu 1, Yuchen Zhou 1, Muduo Xu 2, Zisheng Liang 1, Simiao Ren 3, Jiayu Xue 4, Meige Yang 5, Siying Chen 1, Jingheng Huan 1

1 citations · 41 references · arXiv (Cornell University)

α

Published on arXiv

2602.20569

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

All benchmarked detectors collapse on AI-forged documents: DocTamper drops from AUC=0.98 (in-distribution) to 0.563 with pixel IoU=0.020, and GPT-4o reaches only AUC=0.509 — essentially random chance.

AIForge-Doc

Novel technique introduced


We present AIForge-Doc, the first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation. Existing document forgery datasets rely on traditional digital editing tools (e.g., Adobe Photoshop, GIMP), creating a critical gap: state-of-the-art detectors are blind to the rapidly growing threat of AI-forged document fraud. AIForge-Doc addresses this gap by systematically forging numeric fields in real-world receipt and form images using two AI inpainting APIs -- Gemini 2.5 Flash Image and Ideogram v2 Edit -- yielding 4,061 forged images from four public document datasets (CORD, WildReceipt, SROIE, XFUND) across nine languages, annotated with pixel-precise tampered-region masks in DocTamper-compatible format. We benchmark three representative detectors -- TruFor, DocTamper, and a zero-shot GPT-4o judge -- and find that all existing methods degrade substantially: TruFor achieves AUC=0.751 (zero-shot, out-of-distribution) vs. AUC=0.96 on NIST16; DocTamper achieves AUC=0.563 vs. AUC=0.98 in-distribution, with pixel-level IoU=0.020; GPT-4o achieves only 0.509 -- essentially at chance -- confirming that AI-forged values are indistinguishable to automated detectors and VLMs. These results demonstrate that AIForge-Doc represents a qualitatively new and unsolved challenge for document forensics.


Key Contributions

  • First benchmark (AIForge-Doc) targeting diffusion-model-based inpainting forgery in financial/form documents, with 4,061 forged images across nine languages and pixel-level tampered-region annotations in DocTamper-compatible format
  • Systematic forgery pipeline using two production AI inpainting APIs (Gemini 2.5 Flash Image, Ideogram v2 Edit) applied to four real-world document datasets (CORD, WildReceipt, SROIE, XFUND)
  • Empirical evidence that all existing detectors (TruFor AUC=0.751, DocTamper AUC=0.563 with IoU=0.020, GPT-4o AUC=0.509) fail substantially on AI-forged documents, establishing a new unsolved challenge for document forensics

🛡️ Threat Analysis

Output Integrity Attack

The paper is centrally about detecting AI-generated content — specifically diffusion-model-based inpainting used to forge numeric fields in financial documents. The benchmark evaluates whether state-of-the-art output integrity detectors (TruFor, DocTamper, GPT-4o) can identify AI-forged content, which is squarely within ML09's scope of AI-generated content detection and output authenticity verification. Existing forensic tools trained on traditional edits are effectively blind to diffusion-model outputs (AUC near chance), exposing a critical gap in output integrity tooling.


Details

Domains
vision
Model Types
diffusionvlmtransformer
Threat Tags
inference_timedigital
Datasets
CORDWildReceiptSROIEXFUNDNIST16
Applications
document forgery detectionfinancial document forensicsform integrity verification