Benchmarking Adversarial Robustness and Adversarial Training Strategies for Object Detection
Alexis Winter , Jean-Vincent Martini , Romaric Audigier , Angelique Loesch , Bertrand Luvison
Published on arXiv
2602.16494
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Adversarial attacks show significant lack of transferability to transformer-based detectors, and mixing spatially- and semantically-targeted high-perturbation attacks yields the most robust adversarial training strategy
Object detection models are critical components of automated systems, such as autonomous vehicles and perception-based robots, but their sensitivity to adversarial attacks poses a serious security risk. Progress in defending these models lags behind classification, hindered by a lack of standardized evaluation. It is nearly impossible to thoroughly compare attack or defense methods, as existing work uses different datasets, inconsistent efficiency metrics, and varied measures of perturbation cost. This paper addresses this gap by investigating three key questions: (1) How can we create a fair benchmark to impartially compare attacks? (2) How well do modern attacks transfer across different architectures, especially from Convolutional Neural Networks to Vision Transformers? (3) What is the most effective adversarial training strategy for robust defense? To answer these, we first propose a unified benchmark framework focused on digital, non-patch-based attacks. This framework introduces specific metrics to disentangle localization and classification errors and evaluates attack cost using multiple perceptual metrics. Using this benchmark, we conduct extensive experiments on state-of-the-art attacks and a wide range of detectors. Our findings reveal two major conclusions: first, modern adversarial attacks against object detection models show a significant lack of transferability to transformer-based architectures. Second, we demonstrate that the most robust adversarial training strategy leverages a dataset composed of a mix of high-perturbation attacks with different objectives (e.g., spatial and semantic), which outperforms training on any single attack.
Key Contributions
- Unified benchmark framework for digital non-patch-based adversarial attacks on object detectors, introducing AP_loc and CSR metrics to disentangle localization vs. classification failures
- Cross-architectural transferability analysis revealing that modern adversarial attacks fail to transfer from CNN-based to transformer-based detectors (e.g., DINO)
- Empirical finding that adversarial training on a mix of high-perturbation attacks with complementary objectives (spatial + semantic) outperforms training on any single attack type
🛡️ Threat Analysis
Paper benchmarks digital non-patch-based adversarial attacks at inference time on object detectors and evaluates adversarial training as a defense — the entire paper is organized around input manipulation attacks and robustness to them.