Robust Detection of Synthetic Tabular Data under Schema Variability

The rise of powerful generative models has sparked concerns over data authenticity. While detection methods have been extensively developed for images and text, the case of tabular data, despite its ubiquity, has been largely overlooked. Yet, detecting synthetic tabular data is especially challenging due to its heterogeneous structure and unseen formats at test time. We address the underexplored task of detecting synthetic tabular data ''in the wild'', i.e. when the detector is deployed on tables with variable and previously unseen schemas. We introduce a novel datum-wise transformer architecture that significantly outperforms the only previously published baseline, improving both AUC and accuracy by 7 points. By incorporating a table-adaptation component, our model gains an additional 7 accuracy points, demonstrating enhanced robustness. This work provides the first strong evidence that detecting synthetic tabular data in real-world conditions is feasible, and demonstrates substantial improvements over previous approaches. Following acceptance of the paper, we are finalizing the administrative and licensing procedures necessary for releasing the source code. This extended version will be updated as soon as the release is complete.

Key Contributions

Novel datum-wise transformer architecture that is table-agnostic and invariant to column permutations, enabling deployment on tables with previously unseen schemas
Table-adaptation component that provides an additional 7 accuracy points of robustness under cross-table schema shift
First strong evidence that detecting synthetic tabular data in real-world variable-schema conditions is feasible, outperforming the prior baseline by 7 points in both AUC and accuracy

🛡️ Threat Analysis

Output Integrity Attack

The paper's primary contribution is a novel architecture for detecting AI-generated (synthetic) tabular data — a form of AI-generated content detection. It introduces a new detection methodology (datum-wise transformer + table-adaptation) rather than merely applying existing detectors to a new domain, analogous to novel deepfake or AI-text detection architectures that qualify under ML09.

Details

Domains

tabular

Model Types

transformer

Threat Tags

inference_time

Datasets

AdultInsuranceHiggsAbalone

Applications

2025 0 cit.

Output Integrity Attack

60%

Robust Detection of Synthetic Tabular Data under Schema Variability

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SHLIME: Foiling adversarial attacks fooling SHAP and LIME

DevFD: Developmental Face Forgery Detection by Learning Shared and Orthogonal LoRA Subspaces

XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection

Segment Transformer: AI-Generated Music Detection via Music Structural Analysis

Fine-Grained DINO Tuning with Dual Supervision for Face Forgery Detection

HuLA: Prosody-Aware Anti-Spoofing with Multi-Task Learning for Expressive and Emotional Synthetic Speech

Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection

Deepfake Detection that Generalizes Across Benchmarks