defense 2026

Unveiling the Backdoor Mechanism Hidden Behind Catastrophic Overfitting in Fast Adversarial Training

Mengnan Zhao 1, Lihe Zhang 2, Tianhang Zheng 3, Bo Wang 2, Baocai Yin 2

0 citations

α

Published on arXiv

2604.24350

Input Manipulation Attack

OWASP ML Top 10 — ML01

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Demonstrates that catastrophic overfitting exhibits pathway division and trigger-like behavior similar to backdoor attacks, and proposed mitigation strategies successfully reduce CO

Weight Outlier Suppression

Novel technique introduced


Fast Adversarial Training (FAT) has attracted significant attention due to its efficiency in enhancing neural network robustness against adversarial attacks. However, FAT is prone to catastrophic overfitting (CO), wherein models overfit to the specific attack used during training and fail to generalize to others. While existing methods introduce diverse hypotheses and propose various strategies to mitigate CO, a systematic and intuitive explanation of CO remains absent. In this work, we innovatively interpret CO through the lens of backdoor. Through validations on pathway division, diverse feature predictions, and universal class distinguishable triggers in CO, we conceptualize CO as a weak trigger variant of unlearnable tasks, unifying CO, backdoor attacks, and unlearnable tasks under a common theoretical framework. Guided by this, we leverage several backdoor inspired strategies to mitigate CO: (i) Recalibrate CO affected model parameters using vanilla fine tuning, linear probing, or reinitialization-based techniques; (ii) Introduce a weight outlier suppression constraint to regulate abnormal deviations in model weights. Extensive experiments support our interpretation of CO and show the efficacy of the proposed mitigation strategies.


Key Contributions

  • Novel interpretation of catastrophic overfitting as a backdoor-like trigger overfitting phenomenon, unifying CO, backdoor attacks, and unlearnable tasks under a common framework
  • Backdoor-inspired mitigation strategies including fine-tuning techniques (vanilla fine-tuning, linear probing, reinitialization) and weight outlier suppression constraint
  • Validation that adversarial perturbations in CO-affected models encode universal class-discriminative triggers similar to backdoor triggers

🛡️ Threat Analysis

Input Manipulation Attack

Paper studies catastrophic overfitting in fast adversarial training (FAT), where models fail to generalize to adversarial attacks beyond the training attack — this is fundamentally about adversarial robustness and evasion attacks at inference time.

Model Poisoning

Paper conceptualizes catastrophic overfitting as a weak-trigger variant of backdoor attacks, analyzing pathway division and trigger-like behavior, and proposes backdoor-inspired defenses including weight outlier suppression.


Details

Domains
vision
Model Types
cnn
Threat Tags
training_timeinference_timewhite_box
Applications
image classification