GNN Explanations that do not Explain and How to find Them

Explanations provided by Self-explainable Graph Neural Networks (SE-GNNs) are fundamental for understanding the model's inner workings and for identifying potential misuse of sensitive attributes. Although recent works have highlighted that these explanations can be suboptimal and potentially misleading, a characterization of their failure cases is unavailable. In this work, we identify a critical failure of SE-GNN explanations: explanations can be unambiguously unrelated to how the SE-GNNs infer labels. We show that, on the one hand, many SE-GNNs can achieve optimal true risk while producing these degenerate explanations, and on the other, most faithfulness metrics can fail to identify these failure modes. Our empirical analysis reveals that degenerate explanations can be maliciously planted (allowing an attacker to hide the use of sensitive attributes) and can also emerge naturally, highlighting the need for reliable auditing. To address this, we introduce a novel faithfulness metric that reliably marks degenerate explanations as unfaithful, in both malicious and natural settings. Our code is available in the supplemental.

Key Contributions

Theoretical characterization showing multiple SE-GNN architectures can achieve optimal loss while producing degenerate (unfaithful) explanations unrelated to their actual inference
Empirical demonstration that degenerate explanations can be maliciously planted to hide sensitive attribute usage, and that most existing faithfulness metrics fail to detect them
A novel faithfulness metric (EST/SUFFCAUSE) that reliably identifies degenerate explanations in both malicious and naturally-occurring settings, along with a benchmark for evaluating faithfulness metrics

🛡️ Threat Analysis

Output Integrity Attack

The paper focuses on the integrity of model outputs (explanations): it shows explanations can be maliciously manipulated at training time to be unrelated to actual inference, concealing use of protected attributes from auditors, and proposes the EST/SUFFCAUSE faithfulness metric to detect these tampered outputs. The explanations are the model outputs whose integrity is being attacked and defended.