Two Sides of the Same Coin: Learning the Backdoor to Remove the Backdoor

Abstract

The community has recently developed various training-time defenses to counter neural backdoors introduced through data poisoning. In light of the observation that a model learns poisonous samples responsible for the backdoor easier than benign samples, these approaches either use a fixed threshold of the training loss for splitting or iteratively learn a reference model as an oracle for identifying benign samples. In particular, the latter has proven effective for anti-backdoor learning.

Our method, HARVEY, leverages a similar yet crucially different technique: learning an oracle for poisonous rather than benign samples. Learning a backdoored reference model is significantly easier than learning a reference model on benign data. Consequently, we can identify poisonous samples much more accurately than related work identifies benign samples. This crucial difference enables near-perfect backdoor removal as we demonstrate in our evaluation. HARVEY substantially outperforms related approaches across attack types, datasets, and architectures, lowering the attack success rate to the very minimum at a negligible loss in natural accuracy. The figure below shows an overview of our methods working principle.

For further details please consult the conference publication.

Team

Proof-of-Concept Implementations

For the sake of reproducibility and to foster future research, we make the implementations of HARVEY for generating removing backdoor at training time publicly available at:

https://github.com/intellisec/harvey

Publication

A detailed description of our work will been presented at the (AAAI 2025) in March 2025. If you would like to cite our work, please use the reference as provided below:

@InProceedings{Zhao2025Two,
author    = {Qi Zhao and Christian Wressnegger},
booktitle = {Proc. of the Annual {AAAI} Conference on Artificial Intelligence ({AAAI})},
title     = {Two Sides of the Same Coin: Learning the Backdoor to Remove the Backdoor},
year      = {2025},
month     = march,
}

A preprint of the paper is available here.