attack 2025

Vision Transformers: the threat of realistic adversarial patches

Kasper Cools 1,2, Clara Maathuis 3, Alexander M. van Oers 4, Claudia S. Hübner 5, Nikos Deligiannis 2,6, Marijke Vandewal 1, Geert De Cubber 1

0 citations · 34 references · Security + Defence

α

Published on arXiv

2509.21084

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

CT-augmented adversarial patches transferred from YOLOv5-CNN achieve attack success rates of 40.04%–99.97% across four ViT models, with facebook/dino-vitb16 being most vulnerable and pre-training methodology strongly predicting resilience.

Creases Transformation (CT) adversarial patches

Novel technique introduced


The increasing reliance on machine learning systems has made their security a critical concern. Evasion attacks enable adversaries to manipulate the decision-making processes of AI systems, potentially causing security breaches or misclassification of targets. Vision Transformers (ViTs) have gained significant traction in modern machine learning due to increased 1) performance compared to Convolutional Neural Networks (CNNs) and 2) robustness against adversarial perturbations. However, ViTs remain vulnerable to evasion attacks, particularly to adversarial patches, unique patterns designed to manipulate AI classification systems. These vulnerabilities are investigated by designing realistic adversarial patches to cause misclassification in person vs. non-person classification tasks using the Creases Transformation (CT) technique, which adds subtle geometric distortions similar to those occurring naturally when wearing clothing. This study investigates the transferability of adversarial attack techniques used in CNNs when applied to ViT classification models. Experimental evaluation across four fine-tuned ViT models on a binary person classification task reveals significant vulnerability variations: attack success rates ranged from 40.04% (google/vit-base-patch16-224-in21k) to 99.97% (facebook/dino-vitb16), with google/vit-base-patch16-224 achieving 66.40% and facebook/dinov3-vitb16 reaching 65.17%. These results confirm the cross-architectural transferability of adversarial patches from CNNs to ViTs, with pre-training dataset scale and methodology strongly influencing model resilience to adversarial attacks.


Key Contributions

  • Introduces the Creases Transformation (CT) technique that adds physically plausible fabric folds and stretching distortions during adversarial patch optimization, improving physical realism
  • Demonstrates cross-architectural transferability of adversarial patches originally crafted for CNNs (YOLOv5) to four fine-tuned ViT models on a binary person classification task
  • Quantifies significant vulnerability variation across ViT pre-training strategies, with attack success rates ranging from 40.04% to 99.97% and showing pre-training dataset scale as a key resilience factor

🛡️ Threat Analysis

Input Manipulation Attack

Proposes adversarial patches optimized with a Creases Transformation layer to cause misclassification at inference time across CNN and ViT architectures — a direct physical adversarial patch/evasion attack.


Details

Domains
vision
Model Types
transformercnn
Threat Tags
white_boxinference_timetargetedphysical
Datasets
binary person classification dataset (fine-tuned ViT evaluation set)
Applications
person detectionborder security surveillancebody-worn camera systems