DeepDefense: Layer-Wise Gradient-Feature Alignment for Building Robust Neural Networks
Ci Lin , Tet Yeap , Iluju Kiringa , Biwei Zhang
Published on arXiv
2511.13749
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
CNN models trained with DeepDefense outperform standard adversarial training by up to 15.2% under APGD and 24.7% under FGSM on CIFAR-10, and require 20–30× larger perturbations to cause misclassification under DeepFool and EADEN.
DeepDefense (Gradient-Feature Alignment)
Novel technique introduced
Deep neural networks are known to be vulnerable to adversarial perturbations, which are small and carefully crafted inputs that lead to incorrect predictions. In this paper, we propose DeepDefense, a novel defense framework that applies Gradient-Feature Alignment (GFA) regularization across multiple layers to suppress adversarial vulnerability. By aligning input gradients with internal feature representations, DeepDefense promotes a smoother loss landscape in tangential directions, thereby reducing the model's sensitivity to adversarial noise. We provide theoretical insights into how adversarial perturbation can be decomposed into radial and tangential components and demonstrate that alignment suppresses loss variation in tangential directions, where most attacks are effective. Empirically, our method achieves significant improvements in robustness across both gradient-based and optimization-based attacks. For example, on CIFAR-10, CNN models trained with DeepDefense outperform standard adversarial training by up to 15.2% under APGD attacks and 24.7% under FGSM attacks. Against optimization-based attacks such as DeepFool and EADEN, DeepDefense requires 20 to 30 times higher perturbation magnitudes to cause misclassification, indicating stronger decision boundaries and a flatter loss landscape. Our approach is architecture-agnostic, simple to implement, and highly effective, offering a promising direction for improving the adversarial robustness of deep learning models.
Key Contributions
- Gradient-Feature Alignment (GFA) regularization that aligns input gradients with internal feature representations to suppress adversarial vulnerability across all layers
- Theoretical decomposition of adversarial perturbations into radial and tangential components, showing GFA suppresses loss variation in the tangential directions most exploited by attacks
- Architecture-agnostic layer-wise defense that outperforms standard adversarial training by up to 15.2% under APGD and 24.7% under FGSM on CIFAR-10, requiring 20–30× higher perturbation for optimization-based attacks
🛡️ Threat Analysis
Proposes a defense specifically against adversarial input perturbations at inference time (FGSM, APGD, DeepFool, EADEN), suppressing adversarial vulnerability through GFA regularization that flattens the loss landscape in tangential directions.