Semantics-Preserving Evasion of LLM Vulnerability Detectors

LLM-based vulnerability detectors are increasingly deployed in security-critical code review, yet their resilience to evasion under behavior-preserving edits remains poorly understood. We evaluate detection-time integrity under a semantics-preserving threat model by instantiating diverse behavior-preserving code transformations on a unified C/C++ benchmark (N=5000), and introduce a metric of joint robustness across different attack methods/carriers. Across models, we observe a systemic failure of semantic invariant adversarial transformations: even state-of-the-art vulnerability detectors perform well on clean inputs while predictions flip under behavior-equivalent edits. Universal adversarial strings optimized on a single surrogate model remain effective when transferred to black-box APIs, and gradient access can further amplify evasion success. These results show that even high-performing detectors are vulnerable to low-cost, semantics-preserving evasion. Our carrier-based metrics provide practical diagnostics for evaluating LLM-based code detectors.

Key Contributions

Carrier-constrained GCG: gradient-based adversarial optimization restricted to compile-valid code carriers (identifier substitution, preprocessor macros) that preserves program semantics while evading LLM vulnerability detectors
Complete Resistance (CR) metric measuring the fraction of vulnerabilities that withstand all semantics-preserving transformations simultaneously, enabling joint robustness evaluation across attack carriers
Empirical demonstration that high-performing detectors (70%+ clean accuracy) collapse under semantics-preserving edits and that universal adversarial strings transfer effectively from surrogate models to black-box APIs

🛡️ Threat Analysis

Input Manipulation Attack

The paper crafts adversarial inputs (semantics-preserving code transformations including carrier-constrained GCG) that flip LLM-based vulnerability detector predictions at inference time without altering program behavior — a direct adversarial evasion attack on a classifier.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxblack_boxinference_timetargeteddigital

Datasets

C/C++ vulnerability benchmark (N=5000)

Applications

2025 0 cit.

Input Manipulation Attack

85%