benchmark 2025

Quantifying True Robustness: Synonymity-Weighted Similarity for Trustworthy XAI Evaluation

Christopher Burger

1 citations · 16 references · arXiv (Cornell University)

α

Published on arXiv

2501.01516

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Synonymity-weighted similarity metrics significantly increase measured explanation similarity for synonym-based perturbations compared to standard metrics, demonstrating that standard metrics systematically overestimate adversarial attack success on XAI systems.

Synonymity-Weighted Similarity

Novel technique introduced


Adversarial attacks challenge the reliability of Explainable AI (XAI) by altering explanations while the model's output remains unchanged. The success of these attacks on text-based XAI is often judged using standard information retrieval metrics. We argue these measures are poorly suited in the evaluation of trustworthiness, as they treat all word perturbations equally while ignoring synonymity, which can misrepresent an attack's true impact. To address this, we apply synonymity weighting, a method that amends these measures by incorporating the semantic similarity of perturbed words. This produces more accurate vulnerability assessments and provides an important tool for assessing the robustness of AI systems. Our approach prevents the overestimation of attack success, leading to a more faithful understanding of an XAI system's true resilience against adversarial manipulation.


Key Contributions

  • Demonstrates that standard information retrieval metrics (e.g., Kendall's τ) over-estimate adversarial attack success on XAI by treating all word substitutions as equally impactful
  • Proposes synonymity-weighted similarity: an amended evaluation metric that incorporates semantic similarity of perturbed words to produce more accurate robustness assessments
  • Shows that accounting for synonymity prevents false-positive attack success measurements and gives a more faithful picture of XAI system resilience

🛡️ Threat Analysis

Input Manipulation Attack

The paper is centered on adversarial input perturbations (word substitutions) that manipulate XAI model outputs (explanations) at inference time without changing the underlying model's prediction — a classic input manipulation attack scenario. The paper contributes better benchmark metrics for measuring the true success of these attacks.


Details

Domains
nlp
Model Types
transformer
Threat Tags
inference_timedigital
Applications
explainable aitext-based model explanations