Quantifying True Robustness: Synonymity-Weighted Similarity for Trustworthy XAI Evaluation

Adversarial attacks challenge the reliability of Explainable AI (XAI) by altering explanations while the model's output remains unchanged. The success of these attacks on text-based XAI is often judged using standard information retrieval metrics. We argue these measures are poorly suited in the evaluation of trustworthiness, as they treat all word perturbations equally while ignoring synonymity, which can misrepresent an attack's true impact. To address this, we apply synonymity weighting, a method that amends these measures by incorporating the semantic similarity of perturbed words. This produces more accurate vulnerability assessments and provides an important tool for assessing the robustness of AI systems. Our approach prevents the overestimation of attack success, leading to a more faithful understanding of an XAI system's true resilience against adversarial manipulation.

Key Contributions

Demonstrates that standard information retrieval metrics (e.g., Kendall's τ) over-estimate adversarial attack success on XAI by treating all word substitutions as equally impactful
Proposes synonymity-weighted similarity: an amended evaluation metric that incorporates semantic similarity of perturbed words to produce more accurate robustness assessments
Shows that accounting for synonymity prevents false-positive attack success measurements and gives a more faithful picture of XAI system resilience

🛡️ Threat Analysis

Input Manipulation Attack

The paper is centered on adversarial input perturbations (word substitutions) that manipulate XAI model outputs (explanations) at inference time without changing the underlying model's prediction — a classic input manipulation attack scenario. The paper contributes better benchmark metrics for measuring the true success of these attacks.