benchmark 2025

Evaluating Concept Filtering Defenses against Child Sexual Abuse Material Generation by Text-to-Image Models

Ana-Maria Cretu 1, Klim Kireev 1,2, Amro Abdalla 3, Wisdom Obinna 3, Raphael Meier 4, Sarah Adel Bargal 3, Elissa M. Redmiles 3, Carmela Troncoso 1,2

0 citations · 48 references · arXiv

α

Published on arXiv

2512.05707

Data Poisoning Attack

OWASP ML Top 10 — ML02

Transfer Learning Attack

OWASP ML Top 10 — ML07

Key Finding

Concept filtering offers limited protection for closed-weight T2I models and no protection for open-weight models: prompting bypasses require few extra queries and fine-tuning on child images fully recovers filtered concepts.


We evaluate the effectiveness of child filtering to prevent the misuse of text-to-image (T2I) models to create child sexual abuse material (CSAM). First, we capture the complexity of preventing CSAM generation using a game-based security definition. Second, we show that current detection methods cannot remove all children from a dataset. Third, using an ethical proxy for CSAM (a child wearing glasses, hereafter, CWG), we show that even when only a small percentage of child images are left in the training dataset, there exist prompting strategies that generate CWG from a child-filtered T2I model using only a few more queries than when the model is trained on the unfiltered data. Fine-tuning the filtered model on child images further reduces the additional query overhead. We also show that reintroducing a concept is possible via fine-tuning even if filtering is perfect. Our results demonstrate that current filtering methods offer limited protection to closed-weight models and no protection to open-weight models, while reducing the generality of the model by hindering the generation of child-related concepts or changing their representation. We conclude by outlining challenges in conducting evaluations that establish robust evidence on the impact of AI safety mitigations for CSAM.


Key Contributions

  • Game-based formal security definition capturing the adversarial complexity of preventing CSAM generation from T2I models
  • Empirical benchmark showing current automated child detection methods cannot remove all children from large training datasets
  • Demonstration that prompting strategies can elicit filtered concepts with only marginally more queries than unfiltered models, and fine-tuning can fully reintroduce filtered concepts even under perfect upstream filtering

🛡️ Threat Analysis

Data Poisoning Attack

The paper's primary defense under evaluation is concept filtering — removing child images from training datasets (data sanitization). This is squarely the ML02 defense space. The paper evaluates whether this data-level sanitization is sufficient to prevent harmful concept generation, showing current methods fail to fully cleanse training data.

Transfer Learning Attack

A significant contribution is demonstrating that fine-tuning a filtered open-weight T2I model on a small set of child images reintroduces the filtered concept — and that this works even when upstream filtering is perfect. This directly exploits the transfer learning / fine-tuning process as an attack vector, fitting ML07's scope.


Details

Domains
visiongenerative
Model Types
diffusion
Threat Tags
training_timeinference_timeblack_boxwhite_box
Datasets
LAION
Applications
text-to-image generationcontent safety filtering