Explanation-Aware Backdoors


Explainable machine learning holds great potential for analyzing and understanding learning-based systems. These methods can, however, be manipulated to present unfaithful explanations, giving rise to powerful and stealthy adversaries. In this paper, we demonstrate how to fully disguise the adversarial operation of a machine learning model. Similar to neural backdoors, we change the model’s prediction upon trigger presence but simultaneously fool an explanation method that is applied post-hoc for analysis. This enables an adversary to hide the presence of the trigger or point the explanation to entirely different portions of the input, throwing a red herring. We analyze different manifestations of these explanation-aware backdoors for gradient- and propagation-based explanation methods in the image domain, before we resume to conduct a red-herring attack against malware classification.

For further details please consult the conference publication.


Proof-of-Concept Implementations

For the sake of reproducibility and to foster future research, we make the implementations of our explanation-aware backdoors available at:


A detailed description of our work will be presented at the 44th IEEE Symposium on Security and Privacy (IEEE S&P 2023) in May 2023. If you would like to cite our work, please use the reference as provided below:

author =    {Maximilian Noppel and Lukas Peter and Christian Wressnegger},
title =     {Disguising Attacks with Explanation-Aware Backdoors},
booktitle = {Proc. of 44th IEEE Symposium on Security and Privacy (S&P)},
year =      2023,
month =     may
A preprint of the paper is available here and here (arXiv)


The presentation slides are available here.