benchmark 2026

Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints

Adrian Serrano , Erwan Umlil , Ronan Thomas

0 citations · 20 references · arXiv

α

Published on arXiv

2601.05986

Input Manipulation Attack

OWASP ML Top 10 — ML01

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Adversarial training reliably improves in-distribution robustness of deepfake detectors but degrades cross-dataset performance under certain strategies, revealing a generalization-robustness tradeoff in realistic deployment.

DUMB/DUMBer (extended to deepfake detection)

Novel technique introduced


Deepfake detection systems deployed in real-world environments are subject to adversaries capable of crafting imperceptible perturbations that degrade model performance. While adversarial training is a widely adopted defense, its effectiveness under realistic conditions -- where attackers operate with limited knowledge and mismatched data distributions - remains underexplored. In this work, we extend the DUMB -- Dataset soUrces, Model architecture and Balance - and DUMBer methodology to deepfake detection. We evaluate detectors robustness against adversarial attacks under transferability constraints and cross-dataset configuration to extract real-world insights. Our study spans five state-of-the-art detectors (RECCE, SRM, XCeption, UCF, SPSL), three attacks (PGD, FGSM, FPBA), and two datasets (FaceForensics++ and Celeb-DF-V2). We analyze both attacker and defender perspectives mapping results to mismatch scenarios. Experiments show that adversarial training strategies reinforce robustness in the in-distribution cases but can also degrade it under cross-dataset configuration depending on the strategy adopted. These findings highlight the need for case-aware defense strategies in real-world applications exposed to adversarial attacks.


Key Contributions

  • Extends the DUMB/DUMBer evaluation methodology to deepfake detection, enabling systematic robustness assessment under transferability and cross-dataset constraints.
  • Comprehensive evaluation of five state-of-the-art deepfake detectors (RECCE, SRM, XCeption, UCF, SPSL) against three adversarial attacks (PGD, FGSM, FPBA) across two datasets (FaceForensics++, Celeb-DF-V2).
  • Finds that adversarial training improves in-distribution robustness but can degrade cross-dataset performance depending on strategy, motivating case-aware defenses.

🛡️ Threat Analysis

Input Manipulation Attack

Core contribution evaluates imperceptible adversarial perturbations (PGD, FGSM, FPBA) attacking deepfake detection models at inference time, and assesses adversarial training as a defense under transferability (black-box) and cross-dataset mismatch scenarios.

Output Integrity Attack

The targeted systems are deepfake detection models — AI-generated content detectors — placing the benchmark squarely within the output integrity and content authenticity domain that ML09 covers.


Details

Domains
vision
Model Types
cnntransformer
Threat Tags
white_boxblack_boxinference_timedigitaluntargeted
Datasets
FaceForensics++Celeb-DF-V2
Applications
deepfake detectionfacial forgery detection