Robust Detection of Synthetic Tabular Data under Schema Variability
G. Charbel N. Kindji 1,2, Elisa Fromont 2, Lina Maria Rojas-Barahona 1, Tanguy Urvoy 1
Published on arXiv
2509.00092
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
The proposed datum-wise transformer improves AUC and accuracy by 7 points over the only prior baseline, with the table-adaptation component yielding an additional 7 accuracy points under cross-table schema shift.
Datum-wise Transformer
Novel technique introduced
The rise of powerful generative models has sparked concerns over data authenticity. While detection methods have been extensively developed for images and text, the case of tabular data, despite its ubiquity, has been largely overlooked. Yet, detecting synthetic tabular data is especially challenging due to its heterogeneous structure and unseen formats at test time. We address the underexplored task of detecting synthetic tabular data ''in the wild'', i.e. when the detector is deployed on tables with variable and previously unseen schemas. We introduce a novel datum-wise transformer architecture that significantly outperforms the only previously published baseline, improving both AUC and accuracy by 7 points. By incorporating a table-adaptation component, our model gains an additional 7 accuracy points, demonstrating enhanced robustness. This work provides the first strong evidence that detecting synthetic tabular data in real-world conditions is feasible, and demonstrates substantial improvements over previous approaches. Following acceptance of the paper, we are finalizing the administrative and licensing procedures necessary for releasing the source code. This extended version will be updated as soon as the release is complete.
Key Contributions
- Novel datum-wise transformer architecture that is table-agnostic and invariant to column permutations, enabling deployment on tables with previously unseen schemas
- Table-adaptation component that provides an additional 7 accuracy points of robustness under cross-table schema shift
- First strong evidence that detecting synthetic tabular data in real-world variable-schema conditions is feasible, outperforming the prior baseline by 7 points in both AUC and accuracy
🛡️ Threat Analysis
The paper's primary contribution is a novel architecture for detecting AI-generated (synthetic) tabular data — a form of AI-generated content detection. It introduces a new detection methodology (datum-wise transformer + table-adaptation) rather than merely applying existing detectors to a new domain, analogous to novel deepfake or AI-text detection architectures that qualify under ML09.