benchmark 2025

BAID: A Benchmark for Bias Assessment of AI Detectors

Priyam Basu , Yunfeng Zhang , Vipul Raheja

0 citations · 26 references · arXiv

α

Published on arXiv

2512.11505

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

All four evaluated detectors show consistent low recall for texts from underrepresented demographic groups, indicating systematic sociolinguistic bias in deployed AI text detectors

BAID

Novel technique introduced


AI-generated text detectors have recently gained adoption in educational and professional contexts. Prior research has uncovered isolated cases of bias, particularly against English Language Learners (ELLs) however, there is a lack of systematic evaluation of such systems across broader sociolinguistic factors. In this work, we propose BAID, a comprehensive evaluation framework for AI detectors across various types of biases. As a part of the framework, we introduce over 200k samples spanning 7 major categories: demographics, age, educational grade level, dialect, formality, political leaning, and topic. We also generated synthetic versions of each sample with carefully crafted prompts to preserve the original content while reflecting subgroup-specific writing styles. Using this, we evaluate four open-source state-of-the-art AI text detectors and find consistent disparities in detection performance, particularly low recall rates for texts from underrepresented groups. Our contributions provide a scalable, transparent approach for auditing AI detectors and emphasize the need for bias-aware evaluation before these tools are deployed for public use.


Key Contributions

  • BAID evaluation framework with 200k+ samples spanning 7 sociolinguistic bias categories (demographics, age, grade level, dialect, formality, political leaning, topic)
  • Synthetic counterpart generation using LLM prompting to preserve content while reflecting subgroup-specific writing styles
  • Systematic evaluation of four open-source AI text detectors revealing consistent performance disparities against underrepresented groups

🛡️ Threat Analysis

Output Integrity Attack

Proposes an evaluation framework (BAID) specifically for AI-generated text detection systems, revealing consistent reliability failures (low recall) for underrepresented demographic groups — directly assessing output integrity of AI content detection systems.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Datasets
BAID (200k+ samples, introduced by authors)
Applications
ai-generated text detectionacademic integrity toolscontent moderation