Explainable Machine Learning


SemesterWinter 2022
Course typeBlock Seminar
LecturerJun.-Prof. Dr. Wressnegger
AudienceInformatik Master & Bachelor
Credits4 ECTS
Room148, Building 50.34
LanguageEnglish or German


This seminar is concerned with explainable machine learning in computer security. Learning-based systems often are difficult to interpret, and their decisions are opaque to practitioners. This lack of transparency is a considerable problem in computer security, as black-box learning systems are hard to audit and protect from attacks.

The module introduces students to the emerging field of explainable machine learning and teaches them to work up results from recent research. To this end, the students will read up on a sub-field, prepare a seminar report, and present their work at the end of the term to their colleagues.

Topics cover different aspects of the explainability of machine learning methods for the application in computer security in particular.


Tue, 25. Oct, 11:30–13:00Primer on academic writing, assignment of topics
Thu, 3. NovArrange appointment with assistant
Mon, 7. Nov - Fri, 11. Nov1st individual meeting (First overview, ToC)
Mo, 5. Dec - Fri, 9. Dec2nd individual meeting (Feedback on first draft of the report)
Thu, 22. DecSubmit final paper
Mon, 9. JanSubmit review for fellow students
Thu, 12. JanEnd of discussion phase
Fri, 13. JanNotification about paper acceptance/rejection
Fri, 27. JanSubmit camera-ready version of your paper
Fri, 17. FebPresentation at final colloquium

Mailing List

News about the seminar, potential updates to the schedule, and additional material are distributed using a separate mailing list. Moreover, the list enables students to discuss topics of the seminar.

You can subscribe here.


Every student may choose one of the following topics. For each of these, we additionally provide two recent top-tier publications that you should use as a starting point for your own research. For the seminar and your final report, you should not merely summarize that paper, but try to go beyond and arrive at your own conclusions.

Moreover, all of these papers come with open-source implementations. Play around with these and include the lessons learned in your report.

  • Propagation based Explanations

    Propagation based explanations are generated by backpropagating relevance values from a network's output to the input. Therefore a variety of propagation-rules have been proposed. The topic should also consider the individual properties satisfied by the different rules.

    • Sundararajan, Taly, and Yan, "Axiomatic Attribution for Deep Networks.", ICML 2017
    • Montavon et al., "Layer-Wise Relevance Propagation: An Overview.", Book: "Explainable AI: Explainable AI: Interpreting, Explaining and Visualizing Deep Learning" 2019

  • Generating Counterfactual/Contrastive Explanations

    For human being,a good explanation is always contrastive. People do not only ask ‘Why A?’; they ask ‘Why A rather than B?’. Towards human-centered AI, the research field of explainable AI has gone deeper to counterfactual and contrastive explanation, which focus on this alternative scenario B and especially how it can be generated.

    • Stepin et al., "A Survey of Contrastive and Counterfactual Explanation Generation Methods for Explainable Artificial Intelligence", IEEE Access 2021
    • Prashan et al., "Explainable reinforcement learning through a causal lens.", AAAI 2020

  • Concept based Explanations

    Concept based explanation methods try to explain a model's decision in human-understandable concepts, instead of feature importance values. This topic covers the identification of concepts and visualization of the associated explanations.

    • Amirata et al., "Towards automatic concept-based explanations."", NeurIPS 2019
    • Yeh Chih-Kuan et al., "On completeness-aware concept-based explanations in deep neural networks.", NeurIPS 2020
    • Achtibat Reduan, et al. "From" Where" to" What": Towards Human-Understandable Explanations through Concept Relevance Propagation." arXiv:2206.03208 (2022).

  • Interactive Explanations

    Explanations between humans naturally are interactive. In an interactive dialog a human points to the part s/he wants to understand better. This process is guiding the explanation method.

    • Wexler et al., "The What-If Tool: Interactive Probing of Machine Learning Models", IEEE TVCG 2020
    • Sokol et al., "Glass-box: Explaining AI decisions with counterfactual statements through conversation with a voice-enabled virtual assistant.", IJCAI 2018

  • Measuring the Quality of Explanations

    A variety of explanation methods has be proposed in the last year. But which one is the best method for a given task? Clarifying this question is part of this topic.

    • Adebayo et al., "Sanity Checks for Saliency Maps.", NeurIPS 2018
    • Arras, Osman and Samek, "CLEVR-XAI: A benchmark dataset for the ground truth evaluation of neural network explanations", Information Fusion 2022

  • Using User Studies to evaluate Explanations

    This topic is spezifically considering user studies as a way to evaluate the quality of explanations. The report should consider the limitations, problems, and results of such an evaluation.

    • Hendricks et al., "Grounding Visual Explanations”, ECCV 2018
    • Lucic, Haned, and de Rijke, "Why Does My Model Fail?: Contrastive Local Explanations for Retail Forecasting.”, FAT 2020

  • Information Leakage through Explanation

    State-of-the-art explanation methods often cause leakage of sensitive information about the model's parameters as well as it's training data. The additional information obtained through explanations can be abused by adversaries.. How can we prevent information leakage through explanations while preserving the explanation quality?

    • Milli, et al., "Model reconstruction from model explanations.", FAT 2019
    • Shokri et al., "On the Privacy Risks of Model Explanations.", AIES 2021

  • Attacking Explanations

    Similarly to adversarial examples (that attack the prediction), adversaries can fool explanation methods. This includes showing useless/wrong explanations or showing a specific targeted explanation.

    • Zhang et al., "Interpretable Deep Learning under Fire.", USENIX 2020
    • Dombrowski et al., "Explanations Can Be Manipulated and Geometry Is to Blame.", NeurIPS 2019