defense 2025

Robust Detection of Synthetic Tabular Data under Schema Variability

G. Charbel N. Kindji 1,2, Elisa Fromont 2, Lina Maria Rojas-Barahona 1, Tanguy Urvoy 1

0 citations

α

Published on arXiv

2509.00092

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

The proposed datum-wise transformer improves AUC and accuracy by 7 points over the only prior baseline, with the table-adaptation component yielding an additional 7 accuracy points under cross-table schema shift.

Datum-wise Transformer

Novel technique introduced


The rise of powerful generative models has sparked concerns over data authenticity. While detection methods have been extensively developed for images and text, the case of tabular data, despite its ubiquity, has been largely overlooked. Yet, detecting synthetic tabular data is especially challenging due to its heterogeneous structure and unseen formats at test time. We address the underexplored task of detecting synthetic tabular data ''in the wild'', i.e. when the detector is deployed on tables with variable and previously unseen schemas. We introduce a novel datum-wise transformer architecture that significantly outperforms the only previously published baseline, improving both AUC and accuracy by 7 points. By incorporating a table-adaptation component, our model gains an additional 7 accuracy points, demonstrating enhanced robustness. This work provides the first strong evidence that detecting synthetic tabular data in real-world conditions is feasible, and demonstrates substantial improvements over previous approaches. Following acceptance of the paper, we are finalizing the administrative and licensing procedures necessary for releasing the source code. This extended version will be updated as soon as the release is complete.


Key Contributions

  • Novel datum-wise transformer architecture that is table-agnostic and invariant to column permutations, enabling deployment on tables with previously unseen schemas
  • Table-adaptation component that provides an additional 7 accuracy points of robustness under cross-table schema shift
  • First strong evidence that detecting synthetic tabular data in real-world variable-schema conditions is feasible, outperforming the prior baseline by 7 points in both AUC and accuracy

🛡️ Threat Analysis

Output Integrity Attack

The paper's primary contribution is a novel architecture for detecting AI-generated (synthetic) tabular data — a form of AI-generated content detection. It introduces a new detection methodology (datum-wise transformer + table-adaptation) rather than merely applying existing detectors to a new domain, analogous to novel deepfake or AI-text detection architectures that qualify under ML09.


Details

Domains
tabular
Model Types
transformer
Threat Tags
inference_time
Datasets
AdultInsuranceHiggsAbalone
Applications
synthetic tabular data detectiondata authenticity verification