Vision Transformers: the threat of realistic adversarial patches
Kasper Cools 1,2, Clara Maathuis 3, Alexander M. van Oers 4, Claudia S. Hübner 5, Nikos Deligiannis 2,6, Marijke Vandewal 1, Geert De Cubber 1
Published on arXiv
2509.21084
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
CT-augmented adversarial patches transferred from YOLOv5-CNN achieve attack success rates of 40.04%–99.97% across four ViT models, with facebook/dino-vitb16 being most vulnerable and pre-training methodology strongly predicting resilience.
Creases Transformation (CT) adversarial patches
Novel technique introduced
The increasing reliance on machine learning systems has made their security a critical concern. Evasion attacks enable adversaries to manipulate the decision-making processes of AI systems, potentially causing security breaches or misclassification of targets. Vision Transformers (ViTs) have gained significant traction in modern machine learning due to increased 1) performance compared to Convolutional Neural Networks (CNNs) and 2) robustness against adversarial perturbations. However, ViTs remain vulnerable to evasion attacks, particularly to adversarial patches, unique patterns designed to manipulate AI classification systems. These vulnerabilities are investigated by designing realistic adversarial patches to cause misclassification in person vs. non-person classification tasks using the Creases Transformation (CT) technique, which adds subtle geometric distortions similar to those occurring naturally when wearing clothing. This study investigates the transferability of adversarial attack techniques used in CNNs when applied to ViT classification models. Experimental evaluation across four fine-tuned ViT models on a binary person classification task reveals significant vulnerability variations: attack success rates ranged from 40.04% (google/vit-base-patch16-224-in21k) to 99.97% (facebook/dino-vitb16), with google/vit-base-patch16-224 achieving 66.40% and facebook/dinov3-vitb16 reaching 65.17%. These results confirm the cross-architectural transferability of adversarial patches from CNNs to ViTs, with pre-training dataset scale and methodology strongly influencing model resilience to adversarial attacks.
Key Contributions
- Introduces the Creases Transformation (CT) technique that adds physically plausible fabric folds and stretching distortions during adversarial patch optimization, improving physical realism
- Demonstrates cross-architectural transferability of adversarial patches originally crafted for CNNs (YOLOv5) to four fine-tuned ViT models on a binary person classification task
- Quantifies significant vulnerability variation across ViT pre-training strategies, with attack success rates ranging from 40.04% to 99.97% and showing pre-training dataset scale as a key resilience factor
🛡️ Threat Analysis
Proposes adversarial patches optimized with a Creases Transformation layer to cause misclassification at inference time across CNN and ViT architectures — a direct physical adversarial patch/evasion attack.