Towards Robust and Accurate Stability Estimation of Local Surrogate Models in Text-based Explainable AI

Recent work has investigated the concept of adversarial attacks on explainable AI (XAI) in the NLP domain with a focus on examining the vulnerability of local surrogate methods such as Lime to adversarial perturbations or small changes on the input of a machine learning (ML) model. In such attacks, the generated explanation is manipulated while the meaning and structure of the original input remain similar under the ML model. Such attacks are especially alarming when XAI is used as a basis for decision making (e.g., prescribing drugs based on AI medical predictors) or for legal action (e.g., legal dispute involving AI software). Although weaknesses across many XAI methods have been shown to exist, the reasons behind why remain little explored. Central to this XAI manipulation is the similarity measure used to calculate how one explanation differs from another. A poor choice of similarity measure can lead to erroneous conclusions about the stability or adversarial robustness of an XAI method. Therefore, this work investigates a variety of similarity measures designed for text-based ranked lists referenced in related work to determine their comparative suitability for use. We find that many measures are overly sensitive, resulting in erroneous estimates of stability. We then propose a weighting scheme for text-based data that incorporates the synonymity between the features within an explanation, providing more accurate estimates of the actual weakness of XAI methods to adversarial examples.

Key Contributions

Systematic evaluation of ranked-list similarity measures for assessing XAI explanation stability under adversarial perturbations
Finding that many standard similarity measures are overly sensitive, producing erroneous (overstated) instability estimates for XAI methods
A synonymity-based weighting scheme for text-based explanation features that yields more accurate adversarial robustness estimates for surrogate XAI models

🛡️ Threat Analysis

Input Manipulation Attack

The paper investigates adversarial input perturbations that manipulate XAI explanations (e.g., LIME outputs) while preserving the underlying ML model's prediction — this is an inference-time input manipulation attack targeting the explanation pipeline. The paper's contribution is benchmarking and improving similarity measures that quantify the success of these attacks.