benchmark 2026

AICD Bench: A Challenging Benchmark for AI-Generated Code Detection

Daniil Orel 1, Dilshod Azizov 1, Indraneil Paul 2, Yuxia Wang 3, Iryna Gurevych 1,2, Preslav Nakov 1

0 citations · 63 references · arXiv (Cornell University)

α

Published on arXiv

2602.02079

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Extensive evaluation shows detection performance remains far below practical usability under distribution shift and for hybrid or adversarial code across 77 generator models and 9 programming languages.

AICD Bench

Novel technique introduced


Large language models (LLMs) are increasingly capable of generating functional source code, raising concerns about authorship, accountability, and security. While detecting AI-generated code is critical, existing datasets and benchmarks are narrow, typically limited to binary human-machine classification under in-distribution settings. To bridge this gap, we introduce $\emph{AICD Bench}$, the most comprehensive benchmark for AI-generated code detection. It spans $\emph{2M examples}$, $\emph{77 models}$ across $\emph{11 families}$, and $\emph{9 programming languages}$, including recent reasoning models. Beyond scale, AICD Bench introduces three realistic detection tasks: ($\emph{i}$)~$\emph{Robust Binary Classification}$ under distribution shifts in language and domain, ($\emph{ii}$)~$\emph{Model Family Attribution}$, grouping generators by architectural lineage, and ($\emph{iii}$)~$\emph{Fine-Grained Human-Machine Classification}$ across human, machine, hybrid, and adversarial code. Extensive evaluation on neural and classical detectors shows that performance remains far below practical usability, particularly under distribution shift and for hybrid or adversarial code. We release AICD Bench as a $\emph{unified, challenging evaluation suite}$ to drive the next generation of robust approaches for AI-generated code detection. The data and the code are available at https://huggingface.co/AICD-bench}.


Key Contributions

  • AICD Bench: a 2M-sample benchmark spanning 77 LLMs across 11 model families and 9 programming languages for AI-generated code detection
  • Three novel evaluation tasks: robust binary classification under distribution shift, model family attribution, and fine-grained classification across human/machine/hybrid/adversarial code
  • Empirical evaluation showing current classical and neural detectors generalize poorly under OOD settings, especially on hybrid and adversarial code

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses AI-generated content detection — specifically source code — which falls under output integrity and content provenance. The benchmark evaluates detector robustness across distribution shifts, model family attribution, and adversarial/hybrid code, advancing the field of AI-generated content detection rather than merely applying existing methods to a narrow domain.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
AICD Bench
Applications
ai-generated code detectionacademic integritysoftware securityauthorship attribution