benchmark 2026

GNN Explanations that do not Explain and How to find Them

Steve Azzolin 1, Stefano Teso 1, Bruno Lepri 2, Andrea Passerini 1, Sagar Malhotra 3

0 citations · 86 references · arXiv

α

Published on arXiv

2601.20815

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

SE-GNNs can be manipulated to produce maliciously planted degenerate explanations that conceal sensitive attribute usage while maintaining predictive accuracy, and most existing faithfulness metrics fail to detect this; the proposed EST metric reliably identifies such unfaithful explanations.

SUFFCAUSE (EST)

Novel technique introduced


Explanations provided by Self-explainable Graph Neural Networks (SE-GNNs) are fundamental for understanding the model's inner workings and for identifying potential misuse of sensitive attributes. Although recent works have highlighted that these explanations can be suboptimal and potentially misleading, a characterization of their failure cases is unavailable. In this work, we identify a critical failure of SE-GNN explanations: explanations can be unambiguously unrelated to how the SE-GNNs infer labels. We show that, on the one hand, many SE-GNNs can achieve optimal true risk while producing these degenerate explanations, and on the other, most faithfulness metrics can fail to identify these failure modes. Our empirical analysis reveals that degenerate explanations can be maliciously planted (allowing an attacker to hide the use of sensitive attributes) and can also emerge naturally, highlighting the need for reliable auditing. To address this, we introduce a novel faithfulness metric that reliably marks degenerate explanations as unfaithful, in both malicious and natural settings. Our code is available in the supplemental.


Key Contributions

  • Theoretical characterization showing multiple SE-GNN architectures can achieve optimal loss while producing degenerate (unfaithful) explanations unrelated to their actual inference
  • Empirical demonstration that degenerate explanations can be maliciously planted to hide sensitive attribute usage, and that most existing faithfulness metrics fail to detect them
  • A novel faithfulness metric (EST/SUFFCAUSE) that reliably identifies degenerate explanations in both malicious and naturally-occurring settings, along with a benchmark for evaluating faithfulness metrics

🛡️ Threat Analysis

Output Integrity Attack

The paper focuses on the integrity of model outputs (explanations): it shows explanations can be maliciously manipulated at training time to be unrelated to actual inference, concealing use of protected attributes from auditors, and proposes the EST/SUFFCAUSE faithfulness metric to detect these tampered outputs. The explanations are the model outputs whose integrity is being attacked and defended.


Details

Domains
graph
Model Types
gnn
Threat Tags
training_timetargetedwhite_box
Applications
graph classificationmodel auditingexplainable ai