DeepForgeSeal: Latent Space-Driven Semi-Fragile Watermarking for Deepfake Detection Using Multi-Agent Adversarial Reinforcement Learning

Rapid advances in generative AI have led to increasingly realistic deepfakes, posing growing challenges for law enforcement and public trust. Existing passive deepfake detectors struggle to keep pace, largely due to their dependence on specific forgery artifacts, which limits their ability to generalize to new deepfake types. Proactive deepfake detection using watermarks has emerged to address the challenge of identifying high-quality synthetic media. However, these methods often struggle to balance robustness against benign distortions with sensitivity to malicious tampering. This paper introduces a novel deep learning framework that harnesses high-dimensional latent space representations and the Multi-Agent Adversarial Reinforcement Learning (MAARL) paradigm to develop a robust and adaptive watermarking approach. Specifically, we develop a learnable watermark embedder that operates in the latent space, capturing high-level image semantics, while offering precise control over message encoding and extraction. The MAARL paradigm empowers the learnable watermarking agent to pursue an optimal balance between robustness and fragility by interacting with a dynamic curriculum of benign and malicious image manipulations simulated by an adversarial attacker agent. Comprehensive evaluations on the CelebA and CelebA-HQ benchmarks reveal that our method consistently outperforms state-of-the-art approaches, achieving improvements of over 4.5% on CelebA and more than 5.3% on CelebA-HQ under challenging manipulation scenarios.

Key Contributions

Learnable watermark embedder operating in high-dimensional latent space for precise control over message encoding and extraction
Multi-Agent Adversarial Reinforcement Learning (MAARL) paradigm that trains the watermarking agent via a dynamic curriculum of benign and malicious manipulations simulated by an adversarial attacker agent
Outperforms state-of-the-art proactive deepfake detection methods by over 4.5% on CelebA and 5.3% on CelebA-HQ under challenging manipulation scenarios

🛡️ Threat Analysis

Output Integrity Attack

Embeds watermarks in image content (not model weights) to proactively detect deepfake tampering — directly addresses output integrity and content provenance authentication. The semi-fragile design survives benign distortions but breaks under malicious deepfake manipulation, which is a classic output integrity use case.

Details

Domains

visiongenerative

Model Types

rlgenerative

Threat Tags

digitaltraining_time

Datasets

CelebACelebA-HQ

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Where, What, Why: Toward Explainable 3D-GS Watermarking

RDSplat: Robust Watermarking Against Diffusion Editing for 3D Gaussian Splatting

Universal Adversarial Purification with DDIM Metric Loss for Stable Diffusion

Towards Robust Defense against Customization via Protective Perturbation Resistant to Diffusion-based Purification

Anti-Tamper Protection for Unauthorized Individual Image Generation

A Low-Rank Defense Method for Adversarial Attack on Diffusion Models

Bi-Erasing: A Bidirectional Framework for Concept Removal in Diffusion Models

Targeted Data Protection for Diffusion Model by Matching Training Trajectory