Detecting Stealthy Data Poisoning Attacks in AI Code Generators
Published on arXiv
2508.21636
Data Poisoning Attack
OWASP ML Top 10 — ML02
Key Finding
All three evaluated detection methods (spectral signatures, activation clustering, static analysis) fail to reliably detect triggerless data poisoning in code generation models, with representation-based approaches unable to isolate poisoned samples and static analysis producing high false positives and negatives.
Deep learning (DL) models for natural language-to-code generation have become integral to modern software development pipelines. However, their heavy reliance on large amounts of data, often collected from unsanitized online sources, exposes them to data poisoning attacks, where adversaries inject malicious samples to subtly bias model behavior. Recent targeted attacks silently replace secure code with semantically equivalent but vulnerable implementations without relying on explicit triggers to launch the attack, making it especially hard for detection methods to distinguish clean from poisoned samples. We present a systematic study on the effectiveness of existing poisoning detection methods under this stealthy threat model. Specifically, we perform targeted poisoning on three DL models (CodeBERT, CodeT5+, AST-T5), and evaluate spectral signatures analysis, activation clustering, and static analysis as defenses. Our results show that all methods struggle to detect triggerless poisoning, with representation-based approaches failing to isolate poisoned samples and static analysis suffering false positives and false negatives, highlighting the need for more robust, trigger-independent defenses for AI-assisted code generation.
Key Contributions
- Systematic evaluation of stealthy, triggerless data poisoning attacks on three code generation transformer models (CodeBERT, CodeT5+, AST-T5)
- Assessment of three defense categories — spectral signatures analysis, activation clustering, and static analysis — against triggerless poisoning
- Empirical finding that all evaluated detection methods fail against triggerless poisoning, motivating the need for trigger-independent defenses
🛡️ Threat Analysis
Paper studies targeted, triggerless data poisoning attacks that inject malicious training samples into code generation models (CodeBERT, CodeT5+, AST-T5) to bias them toward generating vulnerable code — the core threat is training data corruption without explicit triggers, which is ML02 (not ML10, which requires specific trigger-activated hidden behavior).