defense 2026

Explanation-Guided Adversarial Training for Robust and Interpretable Models

Chao Chen 1, Yanhui Chen 2, Shanshan Lin 2, Dongsheng Hong 2, Shu Wu 3, Xiangwen Liao 2, Chuanyi Liu 1

0 citations

α

Published on arXiv

2603.01938

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

EGAT achieves +37% adversarial accuracy over competitive baselines while producing more semantically meaningful explanations at only +16% additional training cost

EGAT (Explanation-Guided Adversarial Training)

Novel technique introduced


Deep neural networks (DNNs) have achieved remarkable performance in many tasks, yet they often behave as opaque black boxes. Explanation-guided learning (EGL) methods steer DNNs using human-provided explanations or supervision on model attributions. These approaches improve interpretability but typically assume benign inputs and incur heavy annotation costs. In contrast, both predictions and saliency maps of DNNs could dramatically alter facing imperceptible perturbations or unseen patterns. Adversarial training (AT) can substantially improve robustness, but it does not guarantee that model decisions rely on semantically meaningful features. In response, we propose Explanation-Guided Adversarial Training (EGAT), a unified framework that integrates the strength of AT and EGL to simultaneously improve prediction performance, robustness, and explanation quality. EGAT generates adversarial examples on the fly while imposing explanation-based constraints on the model. By jointly optimizing classification performance, adversarial robustness, and attributional stability, EGAT is not only more resistant to unexpected cases, including adversarial attacks and out-of-distribution (OOD) scenarios, but also offer human-interpretable justifications for the decisions. We further formalize EGAT within the Probably Approximately Correct learning framework, demonstrating theoretically that it yields more stable predictions under unexpected situations compared to standard AT. Empirical evaluations on OOD benchmark datasets show that EGAT consistently outperforms competitive baselines in both clean accuracy and adversarial accuracy +37% while producing more semantically meaningful explanations, and requiring only a limited increase +16% in training time.


Key Contributions

  • EGAT: a unified training framework jointly optimizing classification performance, adversarial robustness, and attributional stability via explanation-based constraints during adversarial training
  • PAC learning theoretic formalization showing EGAT yields more stable predictions under adversarial and OOD inputs than standard adversarial training
  • Empirical demonstration of +37% adversarial accuracy over baselines on OOD benchmarks with only +16% training time overhead and semantically improved saliency maps

🛡️ Threat Analysis

Input Manipulation Attack

EGAT is a defense against input manipulation attacks — it augments adversarial training with explanation-based regularization to resist adversarial perturbations at inference time, achieving +37% adversarial accuracy over baselines.


Details

Domains
vision
Model Types
cnntransformer
Threat Tags
white_boxtraining_timeinference_timedigital
Datasets
OOD benchmark datasets
Applications
image classificationmedical imagingout-of-distribution detection