attack 2026

Transferable Backdoor Attacks for Code Models via Sharpness-Aware Adversarial Perturbation

0 citations · 27 references · arXiv (Cornell University)

Published on arXiv

2602.11213

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

STAB achieves 80.1% average cross-dataset attack success rate, outperforming the best dynamic trigger-based attack by 12.4%, and maintains 73.2% ASR after code-specific defenses.

STAB (Sharpness-aware Transferable Adversarial Backdoor)

Novel technique introduced

Code models are increasingly adopted in software development but remain vulnerable to backdoor attacks via poisoned training data. Existing backdoor attacks on code models face a fundamental trade-off between transferability and stealthiness. Static trigger-based attacks insert fixed dead code patterns that transfer well across models and datasets but are easily detected by code-specific defenses. In contrast, dynamic trigger-based attacks adaptively generate context-aware triggers to evade detection but suffer from poor cross-dataset transferability. Moreover, they rely on unrealistic assumptions of identical data distributions between poisoned and victim training data, limiting their practicality. To overcome these limitations, we propose Sharpness-aware Transferable Adversarial Backdoor (STAB), a novel attack that achieves both transferability and stealthiness without requiring complete victim data. STAB is motivated by the observation that adversarial perturbations in flat regions of the loss landscape transfer more effectively across datasets than those in sharp minima. To this end, we train a surrogate model using Sharpness-Aware Minimization to guide model parameters toward flat loss regions, and employ Gumbel-Softmax optimization to enable differentiable search over discrete trigger tokens for generating context-aware adversarial triggers. Experiments across three datasets and two code models show that STAB outperforms prior attacks in terms of transferability and stealthiness. It achieves a 73.2% average attack success rate after defense, outperforming static trigger-based attacks that fail under defense. STAB also surpasses the best dynamic trigger-based attack by 12.4% in cross-dataset attack success rate and maintains performance on clean inputs.

Key Contributions

STAB attack that achieves cross-dataset transferability without requiring access to the victim's training distribution by training a surrogate model on flat loss regions via Sharpness-Aware Minimization (SAM)
Gumbel-Softmax trigger optimization with Maximum Mean Discrepancy (MMD) constraints that enables differentiable joint search over discrete trigger tokens for syntactically valid, context-aware, stealthy code triggers
Demonstrated 73.2% attack success rate after defense and 12.4% improvement over best dynamic attack in cross-dataset settings across three datasets and two code models

🛡️ Threat Analysis

Model Poisoning

STAB embeds hidden, targeted malicious behavior in code models that activates only when specific adversarially-crafted identifier-renaming triggers are present, while maintaining normal behavior on clean inputs — the canonical backdoor/trojan threat. The Sharpness-Aware Minimization and Gumbel-Softmax optimizations are novel contributions to backdoor injection technique, not general data degradation.

Details

Domains

nlp

Model Types

transformer

Threat Tags

training_timetargetedblack_box

Datasets

CodeSearchNetBigCloneBench

Applications

code summarizationcode vulnerability detectioncode model fine-tuning

Read PDF arXiv DOI

Transferable Backdoor Attacks for Code Models via Sharpness-Aware Adversarial Perturbation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Strategic Sample Selection for Improved Clean-Label Backdoor Attacks in Text Classification

SASER: Stego attacks on open-source LLMs

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

Semantic Consensus Decoding: Backdoor Defense for Verilog Code Generation

Delayed Backdoor Attacks: Exploring the Temporal Dimension as a New Attack Surface in Pre-Trained Models

Semantically-Equivalent Transformations-Based Backdoor Attacks against Neural Code Models: Characterization and Mitigation

Trigger Where It Hurts: Unveiling Hidden Backdoors through Sensitivity with Sensitron

Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning