Model-Manipulation Attacks Against Black-Box Explanations

Abstract

The research community has invested great efforts in developing explanation methods that can shed light on the inner workings of neural networks. Despite the availability of precise and fast, model-specific solutions ("white-box" explanations), practitioners often opt for model-agnostic approaches ("black-box" explanations). In this paper, we show that users must not rely on the faithfulness of black-box explanations even if requests verifiably originate from the model in question. We present, Makrut, a model-manipulation attack against the popular model-agnostic, black-box explanation method LIME. Makrut exploits the discrepancy between soft and hard labels to mount different attacks. We (a) elicit uninformative explanations for the entire model, (b) "fairwash" an unfair model, that is, we hide the decisive features in the explanation, and (c) cause a specific explanation upon the presence of a trigger pattern implementing a neural backdoor. The feasibility of these attacks emphasizes the need for more trustworthy explanation methods.

For further details please consult the conference publication or have a look at the short summary. Also, please find more research on the security of XAI at https://xaisec.org.

Team

Proof-of-Concept Implementations

For the sake of reproducibility and to foster future research, we make the implementations of Makrut attacks available at:
https://github.com/intellisec/xai-backdoors-makrut

Publication

A detailed description of our work was presented at the 40th Annual Computer Security Applications Conference (ACSAC) in Decemver 2024. Moreover, a summary of our work has been published at the Proc. of the 48th German Conference on Artificial Intelligence and presented in September 2025. If you would like to cite our work, please use the references as provided below:

@InProceedings{Hegde2024Model,
author    = {Achyut Hegde and Maximilian Noppel and Christian Wressnegger},
booktitle = {Proc. of the 40th Annual Computer Security Applications Conference ({ACSAC})},
title     = {Model-Manipulation Attacks Against Black-Box Explanations},
year      = {2024},
month     = dec,
day       = {9.-13.}
}
@InProceedings{Hegde2025Makrut,
author    = {Achyut Hegde and Maximilian Noppel and Christian Wressnegger},
booktitle = {Proc. of the 48th German Conference on Artificial Intelligence}},
title     = {Makrut Attacks Against Black-Box Explanations},
year      = {2025},
month     = sep,
day       = {16.-19.},
}
A preprint of the papers are available here and here.