Photo from unsplash.com

Post-hoc explanation is the problem of explaining how a machine learning model – whose internal logic is hidden to the end-user and generally complex – produces its outcomes. Current approaches for solving this problem include model explanations and outcome explanations. While these techniques can be beneficial by providing interpretability, there are two fundamental threats to their deployment in real-world applications: the risk of fairwashing that targets the trustworthiness of post-hoc explanation techniques and the risk of model extraction that jeopardizes their privacy guarantees. Fairwashing is an explanation manipulation attack in which the adversary leverages post-hoc explanations techniques to give the impression that a black-box model exhibits some good behaviour (e.g., no discrimination) while it might not be the case. In a model extraction attack, the adversary can exploit post-hoc explanation techniques to steal a faithful copy of a black-box model.

On the one hand, we show that fairwashing is a real threat to existing post-hoc explanation techniques. In particular, we demonstrate that it is possible to systematically rationalize decisions taken by an unfair black-box model using the model explanation and the outcome explanation approaches for any fairness metrics. We illustrate this risk with LaundryML: a regularized rule list enumeration algorithm whose objective is to search for fair rule lists approximating unfair black-box models with high fidelity. We also demonstrate that such explanation attacks can generalize beyond explanation instances and transfer across black-box models.

On the other hand, we demonstrate that a malicious adversary can leverage post-hoc explanations to devise high-accuracy and high-fidelity surrogate models that mimic the black-box that is being explained. In particular, we show that counterfactual-based post-hoc explanations allow the adversary to reach such a goal with low query budgets compared to traditional model extraction attacks.

In this talk, we will first present these two challenges that can potentially undermine the development of reliable post-hoc explanation techniques. Then, we will discuss potential countermeasures.

This talk has been given at the following institutions:

Magnet research group at INRIA Lille. Thank you Marc Tommasi for the invitation.

AI4LIFE research group at Harvard University. Thank you Hima Lakkaraju for the invitation.

CID research group at Heudiasyc. Thank you Sebastien Destercke for the invitation.

*Fairwahing: the risk of rationalization*(Aïvodji et al. 2019)*Characterizing the risk of fairwashing*(Aı̈vodji et al. 2021)*Model extraction from counterfactual explanations*(Aı̈vodji, Bolot, and Gambs 2020)

Aı̈vodji, Ulrich, Hiromi Arai, Sébastien Gambs, and Satoshi Hara. 2021. “Characterizing the Risk of Fairwashing.” *arXiv Preprint arXiv:2106.07504*.

Aı̈vodji, Ulrich, Alexandre Bolot, and Sébastien Gambs. 2020. “Model Extraction from Counterfactual Explanations.” *arXiv Preprint arXiv:2009.01884*.

Aïvodji, Ulrich, Hiromi Arai, Olivier Fortineau, Sébastien Gambs, Satoshi Hara, and Alain Tapp. 2019. “Fairwashing: The Risk of Rationalization.” In *International Conference on Machine Learning*, 161–70. PMLR.

For attribution, please cite this work as

Aïvodji (2021, March 18). Ulrich Aïvodji: Fairwashing and model extraction: two challenges for XAI. Retrieved from https://aivodji.github.io/talks/2021-03-18-fairwashingandmodelextractioninxai/

BibTeX citation

@misc{aïvodji2021fairwashing, author = {Aïvodji, Ulrich}, title = {Ulrich Aïvodji: Fairwashing and model extraction: two challenges for XAI}, url = {https://aivodji.github.io/talks/2021-03-18-fairwashingandmodelextractioninxai/}, year = {2021} }