BAID: A Benchmark for Bias Assessment of AI Detectors
Priyam Basu , Yunfeng Zhang , Vipul Raheja
Published on arXiv
2512.11505
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
All four evaluated detectors show consistent low recall for texts from underrepresented demographic groups, indicating systematic sociolinguistic bias in deployed AI text detectors
BAID
Novel technique introduced
AI-generated text detectors have recently gained adoption in educational and professional contexts. Prior research has uncovered isolated cases of bias, particularly against English Language Learners (ELLs) however, there is a lack of systematic evaluation of such systems across broader sociolinguistic factors. In this work, we propose BAID, a comprehensive evaluation framework for AI detectors across various types of biases. As a part of the framework, we introduce over 200k samples spanning 7 major categories: demographics, age, educational grade level, dialect, formality, political leaning, and topic. We also generated synthetic versions of each sample with carefully crafted prompts to preserve the original content while reflecting subgroup-specific writing styles. Using this, we evaluate four open-source state-of-the-art AI text detectors and find consistent disparities in detection performance, particularly low recall rates for texts from underrepresented groups. Our contributions provide a scalable, transparent approach for auditing AI detectors and emphasize the need for bias-aware evaluation before these tools are deployed for public use.
Key Contributions
- BAID evaluation framework with 200k+ samples spanning 7 sociolinguistic bias categories (demographics, age, grade level, dialect, formality, political leaning, topic)
- Synthetic counterpart generation using LLM prompting to preserve content while reflecting subgroup-specific writing styles
- Systematic evaluation of four open-source AI text detectors revealing consistent performance disparities against underrepresented groups
🛡️ Threat Analysis
Proposes an evaluation framework (BAID) specifically for AI-generated text detection systems, revealing consistent reliability failures (low recall) for underrepresented demographic groups — directly assessing output integrity of AI content detection systems.