defense 2026

Explainability-Guided Defense: Attribution-Aware Model Refinement Against Adversarial Data Attacks

Longwei Wang ¹, Mohammad Navid Nayyem ¹, Abdullah Al Rakin ¹, KC Santosh ¹, Chaowei Zhang ², Yang Zhou ³

¹ University of South Dakota

² Yangzhou University

³ Auburn University

0 citations · 43 references · Industrial Conference on Data ...

Published on arXiv

2601.00968

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

LIME-guided suppression of spurious features yields substantial improvements in adversarial robustness and out-of-distribution generalization on CIFAR-10, CIFAR-10-C, and CIFAR-100 without requiring additional data or model changes.

Attribution-Aware Model Refinement (LIME-guided adversarial training)

Novel technique introduced

The growing reliance on deep learning models in safety-critical domains such as healthcare and autonomous navigation underscores the need for defenses that are both robust to adversarial perturbations and transparent in their decision-making. In this paper, we identify a connection between interpretability and robustness that can be directly leveraged during training. Specifically, we observe that spurious, unstable, or semantically irrelevant features identified through Local Interpretable Model-Agnostic Explanations (LIME) contribute disproportionately to adversarial vulnerability. Building on this insight, we introduce an attribution-guided refinement framework that transforms LIME from a passive diagnostic into an active training signal. Our method systematically suppresses spurious features using feature masking, sensitivity-aware regularization, and adversarial augmentation in a closed-loop refinement pipeline. This approach does not require additional datasets or model architectures and integrates seamlessly into standard adversarial training. Theoretically, we derive an attribution-aware lower bound on adversarial distortion that formalizes the link between explanation alignment and robustness. Empirical evaluations on CIFAR-10, CIFAR-10-C, and CIFAR-100 demonstrate substantial improvements in adversarial robustness and out-of-distribution generalization.

Key Contributions

Identifies a formal link between LIME-based feature attributions and adversarial vulnerability, showing that spurious/unstable features drive susceptibility to adversarial perturbations
Attribution-guided closed-loop refinement pipeline combining feature masking, sensitivity-aware regularization, and adversarial augmentation that integrates into standard adversarial training without requiring extra data or architectures
Theoretical derivation of an attribution-aware lower bound on adversarial distortion grounded in local Lipschitz continuity and gradient attribution alignment

🛡️ Threat Analysis

Input Manipulation Attack

The paper directly defends against adversarial perturbations (input manipulation attacks) at inference time, proposing an attribution-guided refinement pipeline that reduces model reliance on brittle features exploited by adversarial examples such as FGSM/PGD-style attacks.

Details

Domains

vision

Model Types

cnn

Threat Tags

white_boxinference_timedigitaluntargeted

Datasets

CIFAR-10CIFAR-10-CCIFAR-100

Applications

image classificationautonomous navigationmedical diagnostics

Read PDF arXiv DOI

Explainability-Guided Defense: Attribution-Aware Model Refinement Against Adversarial Data Attacks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

NeuroShield: A Neuro-Symbolic Framework for Adversarial Robustness

E-Globe: Scalable $ε$-Global Verification of Neural Networks via Tight Upper Bounds and Pattern-Aware Branching

NERO-Net: A Neuroevolutionary Approach for the Design of Adversarially Robust CNNs

DARD: Dice Adversarial Robustness Distillation against Adversarial Attacks

Robust Spiking Neural Networks Against Adversarial Attacks

Sharpness-Aware Geometric Defense for Robust Out-Of-Distribution Detection

Bridging Symmetry and Robustness: On the Role of Equivariance in Enhancing Adversarial Robustness

Contrastive ECOC: Learning Output Codes for Adversarial Defense