The community has recently developed various training-time defenses to counter neural backdoors introduced through data poisoning. In light of the observation that a model learns poisonous samples responsible for the backdoor easier than benign samples, these approaches either use a fixed threshold of the training loss for splitting or iteratively learn a reference model as an oracle for identifying benign samples. In particular, the latter has proven effective for anti-backdoor learning.
Our method, HARVEY, leverages a similar yet crucially different technique: learning an oracle for poisonous rather than benign samples. Learning a backdoored reference model is significantly easier than learning a reference model on benign data. Consequently, we can identify poisonous samples much more accurately than related work identifies benign samples. This crucial difference enables near-perfect backdoor removal as we demonstrate in our evaluation. HARVEY substantially outperforms related approaches across attack types, datasets, and architectures, lowering the attack success rate to the very minimum at a negligible loss in natural accuracy. The figure below shows an overview of our methods working principle.
For further details please consult the conference publication.
For the sake of reproducibility and to foster future research, we make the implementations of HARVEY for generating removing backdoor at training time publicly available at:
https://github.com/intellisec/harvey
A detailed description of our work will been presented at the (AAAI 2025) in March 2025. If you would like to cite our work, please use the reference as provided below:
@InProceedings{Zhao2025Two,
author = {Qi Zhao and Christian Wressnegger},
booktitle = {Proc. of the Annual {AAAI} Conference on Artificial Intelligence ({AAAI})},
title = {Two Sides of the Same Coin: Learning the Backdoor to Remove the Backdoor},
year = {2025},
month = march,
}
A preprint of the paper is available here.