attack 2026

Semantics-Preserving Evasion of LLM Vulnerability Detectors

Luze Sun 1,2, Alina Oprea 1, Eric Wong 2

0 citations · 28 references · arXiv

α

Published on arXiv

2602.00305

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

State-of-the-art LLM vulnerability detectors with 70%+ clean accuracy fail systematically under semantics-preserving adversarial code transformations, with universal adversarial strings transferring successfully to black-box proprietary APIs.

Carrier-Constrained GCG

Novel technique introduced


LLM-based vulnerability detectors are increasingly deployed in security-critical code review, yet their resilience to evasion under behavior-preserving edits remains poorly understood. We evaluate detection-time integrity under a semantics-preserving threat model by instantiating diverse behavior-preserving code transformations on a unified C/C++ benchmark (N=5000), and introduce a metric of joint robustness across different attack methods/carriers. Across models, we observe a systemic failure of semantic invariant adversarial transformations: even state-of-the-art vulnerability detectors perform well on clean inputs while predictions flip under behavior-equivalent edits. Universal adversarial strings optimized on a single surrogate model remain effective when transferred to black-box APIs, and gradient access can further amplify evasion success. These results show that even high-performing detectors are vulnerable to low-cost, semantics-preserving evasion. Our carrier-based metrics provide practical diagnostics for evaluating LLM-based code detectors.


Key Contributions

  • Carrier-constrained GCG: gradient-based adversarial optimization restricted to compile-valid code carriers (identifier substitution, preprocessor macros) that preserves program semantics while evading LLM vulnerability detectors
  • Complete Resistance (CR) metric measuring the fraction of vulnerabilities that withstand all semantics-preserving transformations simultaneously, enabling joint robustness evaluation across attack carriers
  • Empirical demonstration that high-performing detectors (70%+ clean accuracy) collapse under semantics-preserving edits and that universal adversarial strings transfer effectively from surrogate models to black-box APIs

🛡️ Threat Analysis

Input Manipulation Attack

The paper crafts adversarial inputs (semantics-preserving code transformations including carrier-constrained GCG) that flip LLM-based vulnerability detector predictions at inference time without altering program behavior — a direct adversarial evasion attack on a classifier.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxblack_boxinference_timetargeteddigital
Datasets
C/C++ vulnerability benchmark (N=5000)
Applications
llm-based code vulnerability detectionautomated code review