survey 2025

SoK: Understanding (New) Security Issues Across AI4Code Use Cases

Qilong Wu , Taoran Li , Tianyang Zhou , Varun Chandrasekaran

1 citations · 142 references · arXiv

α

Published on arXiv

2512.18456

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Across six SOTA AI4Code models, insecure code patterns persist in generation, vulnerability detectors fail under semantic-preserving adversarial transformations, and fine-tuning frequently degrades rather than improves security alignment


AI-for-Code (AI4Code) systems are reshaping software engineering, with tools like GitHub Copilot accelerating code generation, translation, and vulnerability detection. Alongside these advances, however, security risks remain pervasive: insecure outputs, biased benchmarks, and susceptibility to adversarial manipulation undermine their reliability. This SoK surveys the landscape of AI4Code security across three core applications, identifying recurring gaps: benchmark dominance by Python and toy problems, lack of standardized security datasets, data leakage in evaluation, and fragile adversarial robustness. A comparative study of six state-of-the-art models illustrates these challenges: insecure patterns persist in code generation, vulnerability detection is brittle to semantic-preserving attacks, fine-tuning often misaligns security objectives, and code translation yields uneven security benefits. From this analysis, we distill three forward paths: embedding secure-by-default practices in code generation, building robust and comprehensive detection benchmarks, and leveraging translation as a route to security-enhanced languages. We call for a shift toward security-first AI4Code, where vulnerability mitigation and robustness are embedded throughout the development life cycle.


Key Contributions

  • Systematic survey of security issues across three AI4Code applications (code generation, vulnerability detection, code translation) with comparative evaluation of six SOTA models
  • Identification of recurring structural gaps: benchmark bias toward Python/toy problems, absence of standardized security datasets, and data leakage in evaluation pipelines
  • Evidence that vulnerability detection is brittle to semantic-preserving adversarial attacks, fine-tuning misaligns security objectives, and code translation yields inconsistent security benefits

🛡️ Threat Analysis

Input Manipulation Attack

Explicitly studies and demonstrates that vulnerability detection models are brittle to semantic-preserving adversarial code transformations at inference time — a direct adversarial evasion attack on ML classifiers.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timetraining_timegrey_box
Applications
code generationvulnerability detectioncode translation