benchmark 2025

Explainable but Vulnerable: Adversarial Attacks on XAI Explanation in Cybersecurity Applications

Maraz Mia , Mir Mehedi A. Pritom

1 citations · 42 references · TPS-ISA

α

Published on arXiv

2510.03623

Output Integrity Attack

OWASP ML Top 10 — ML09

Data Poisoning Attack

OWASP ML Top 10 — ML02

Key Finding

Six distinct adversarial attack procedures can effectively manipulate XAI explanations (SHAP, LIME, IG) in cybersecurity ML systems, demonstrating that trusted explanations can be silently corrupted while models maintain normal predictive accuracy.


Explainable Artificial Intelligence (XAI) has aided machine learning (ML) researchers with the power of scrutinizing the decisions of the black-box models. XAI methods enable looking deep inside the models' behavior, eventually generating explanations along with a perceived trust and transparency. However, depending on any specific XAI method, the level of trust can vary. It is evident that XAI methods can themselves be a victim of post-adversarial attacks that manipulate the expected outcome from the explanation module. Among such attack tactics, fairwashing explanation (FE), manipulation explanation (ME), and backdoor-enabled manipulation attacks (BD) are the notable ones. In this paper, we try to understand these adversarial attack techniques, tactics, and procedures (TTPs) on explanation alteration and thus the effect on the model's decisions. We have explored a total of six different individual attack procedures on post-hoc explanation methods such as SHAP (SHapley Additive exPlanations), LIME (Local Interpretable Model-agnostic Explanation), and IG (Integrated Gradients), and investigated those adversarial attacks in cybersecurity applications scenarios such as phishing, malware, intrusion, and fraudulent website detection. Our experimental study reveals the actual effectiveness of these attacks, thus providing an urgency for immediate attention to enhance the resiliency of XAI methods and their applications.


Key Contributions

  • Empirical evaluation of six adversarial attack procedures (output shuffling, OOD scaffolding, data poisoning, black-box, Makrut, biased-sampling) targeting SHAP, LIME, and IG explanation methods
  • Case studies applying XAI explanation attacks to four cybersecurity datasets: phishing, malware, intrusion, and fraudulent website detection
  • Mapping of XAI attacks to adversarial TTPs, documenting fairwashing, manipulation explanation, and backdoor-enabled manipulation attack categories

🛡️ Threat Analysis

Data Poisoning Attack

Case Study 3 explicitly implements a data poisoning attack on training data to corrupt XAI explanations, and the backdoor-enabled manipulation attacks (BD) involve training-time corruption, making ML02 a concrete secondary contribution.

Output Integrity Attack

The primary focus is on attacks that manipulate ML model explanations after model computation — ML09 explicitly covers 'manipulating predictions/confidence/explanations AFTER model computation,' which maps directly to fairwashing and manipulation explanation attacks on SHAP, LIME, and IG.


Details

Domains
tabular
Model Types
traditional_ml
Threat Tags
black_boxtraining_timeinference_timetargeted
Applications
phishing detectionmalware detectionintrusion detectionfraudulent website detection