Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

Adversarial transferability refers to the capacity of adversarial examples generated on the surrogate model to deceive alternate, unexposed victim models. This property eliminates the need for direct access to the victim model during an attack, thereby raising considerable security concerns in practical applications and attracting substantial research attention recently. In this work, we discern a lack of a standardized framework and criteria for evaluating transfer-based attacks, leading to potentially biased assessments of existing approaches. To rectify this gap, we have conducted an exhaustive review of hundreds of related works, organizing various transfer-based attacks into six distinct categories. Subsequently, we propose a comprehensive framework designed to serve as a benchmark for evaluating these attacks. In addition, we delineate common strategies that enhance adversarial transferability and highlight prevalent issues that could lead to unfair comparisons. Finally, we provide a brief review of transfer-based attacks beyond image classification.

Key Contributions

Systematic categorization of 100+ transfer-based adversarial attacks into six classes: gradient-based, input transformation, advanced objective function, generation-based, model-related, and ensemble-based
Unified evaluation framework (TransferAttack) for standardized and fair comparison of transfer-based attacks across both untargeted and targeted settings
Identification of prevalent methodological issues causing unfair comparisons in existing transfer-based attack literature

🛡️ Threat Analysis

Input Manipulation Attack

The entire paper focuses on transfer-based adversarial examples — crafting perturbations on surrogate models that transfer to fool black-box victim models at inference time. All six attack categories reviewed (gradient-based, input transformation, advanced objectives, generation-based, model-related, ensemble-based) are subtypes of adversarial input manipulation attacks.