Semantics-Preserving Evasion of LLM Vulnerability Detectors
Luze Sun 1,2, Alina Oprea 1, Eric Wong 2
Published on arXiv
2602.00305
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
State-of-the-art LLM vulnerability detectors with 70%+ clean accuracy fail systematically under semantics-preserving adversarial code transformations, with universal adversarial strings transferring successfully to black-box proprietary APIs.
Carrier-Constrained GCG
Novel technique introduced
LLM-based vulnerability detectors are increasingly deployed in security-critical code review, yet their resilience to evasion under behavior-preserving edits remains poorly understood. We evaluate detection-time integrity under a semantics-preserving threat model by instantiating diverse behavior-preserving code transformations on a unified C/C++ benchmark (N=5000), and introduce a metric of joint robustness across different attack methods/carriers. Across models, we observe a systemic failure of semantic invariant adversarial transformations: even state-of-the-art vulnerability detectors perform well on clean inputs while predictions flip under behavior-equivalent edits. Universal adversarial strings optimized on a single surrogate model remain effective when transferred to black-box APIs, and gradient access can further amplify evasion success. These results show that even high-performing detectors are vulnerable to low-cost, semantics-preserving evasion. Our carrier-based metrics provide practical diagnostics for evaluating LLM-based code detectors.
Key Contributions
- Carrier-constrained GCG: gradient-based adversarial optimization restricted to compile-valid code carriers (identifier substitution, preprocessor macros) that preserves program semantics while evading LLM vulnerability detectors
- Complete Resistance (CR) metric measuring the fraction of vulnerabilities that withstand all semantics-preserving transformations simultaneously, enabling joint robustness evaluation across attack carriers
- Empirical demonstration that high-performing detectors (70%+ clean accuracy) collapse under semantics-preserving edits and that universal adversarial strings transfer effectively from surrogate models to black-box APIs
🛡️ Threat Analysis
The paper crafts adversarial inputs (semantics-preserving code transformations including carrier-constrained GCG) that flip LLM-based vulnerability detector predictions at inference time without altering program behavior — a direct adversarial evasion attack on a classifier.