Quantifying True Robustness: Synonymity-Weighted Similarity for Trustworthy XAI Evaluation
Published on arXiv
2501.01516
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Synonymity-weighted similarity metrics significantly increase measured explanation similarity for synonym-based perturbations compared to standard metrics, demonstrating that standard metrics systematically overestimate adversarial attack success on XAI systems.
Synonymity-Weighted Similarity
Novel technique introduced
Adversarial attacks challenge the reliability of Explainable AI (XAI) by altering explanations while the model's output remains unchanged. The success of these attacks on text-based XAI is often judged using standard information retrieval metrics. We argue these measures are poorly suited in the evaluation of trustworthiness, as they treat all word perturbations equally while ignoring synonymity, which can misrepresent an attack's true impact. To address this, we apply synonymity weighting, a method that amends these measures by incorporating the semantic similarity of perturbed words. This produces more accurate vulnerability assessments and provides an important tool for assessing the robustness of AI systems. Our approach prevents the overestimation of attack success, leading to a more faithful understanding of an XAI system's true resilience against adversarial manipulation.
Key Contributions
- Demonstrates that standard information retrieval metrics (e.g., Kendall's τ) over-estimate adversarial attack success on XAI by treating all word substitutions as equally impactful
- Proposes synonymity-weighted similarity: an amended evaluation metric that incorporates semantic similarity of perturbed words to produce more accurate robustness assessments
- Shows that accounting for synonymity prevents false-positive attack success measurements and gives a more faithful picture of XAI system resilience
🛡️ Threat Analysis
The paper is centered on adversarial input perturbations (word substitutions) that manipulate XAI model outputs (explanations) at inference time without changing the underlying model's prediction — a classic input manipulation attack scenario. The paper contributes better benchmark metrics for measuring the true success of these attacks.