The research community has invested great efforts in developing explanation methods that can shed light on the inner workings of neural networks. Despite the availability of precise and fast, model-specific solutions ("white-box" explanations), practitioners often opt for model-agnostic approaches ("black-box" explanations). In this paper, we show that users must not rely on the faithfulness of black-box explanations even if requests verifiably originate from the model in question. We present, Makrut, a model-manipulation attack against the popular model-agnostic, black-box explanation method LIME. Makrut exploits the discrepancy between soft and hard labels to mount different attacks. We (a) elicit uninformative explanations for the entire model, (b) "fairwash" an unfair model, that is, we hide the decisive features in the explanation, and (c) cause a specific explanation upon the presence of a trigger pattern implementing a neural backdoor. The feasibility of these attacks emphasizes the need for more trustworthy explanation methods.
For further details please consult the conference publication. Also, please find more research on the security of XAI at https://xaisec.org.
A detailed description of our work was presented at the 40th Annual Computer Security Applications Conference (ACSAC) in Decemver 2024. If you would like to cite our work, please use the reference as provided below:
@InProceedings{Hegde2024Model,
author = {Achyut Hegde and Maximilian Noppel and Christian Wressnegger},
booktitle = {Proc. of the 40th Annual Computer Security Applications Conference ({ACSAC})},
title = {Model-Manipulation Attacks Against Black-Box Explanations},
year = {2024},
month = dec,
day = {9.-13.}
}
A preprint of the paper is available here.