Privacy-preserving explainable AI: a survey

As the adoption of explainable AI (XAI) continues to expand, the urgency to address its privacy implications intensifies. Despite a growing corpus of research in AI privacy and explainability, there is little attention on privacy-preserving model explanations. This article presents the first thorough survey about privacy attacks on model explanations and their countermeasures. Our contribution to this field comprises a thorough analysis of research papers with a connected taxonomy that facilitates the categorization of privacy attacks and countermeasures based on the targeted explanations. This work also includes an initial investigation into the causes of privacy leaks. Finally, we discuss unresolved issues and prospective research directions uncovered in our analysis. This survey aims to be a valuable resource for the research community and offers clear insights for those new to this domain. To support ongoing research, we have established an online resource repository, which will be continuously updated with new and relevant findings.

Article PDF

Privacy-Aware Explanations for Team Formation

Privacy and Security Considerations in Explainable AI

Balancing XAI with Privacy and Security Considerations

Change history

07 July 2025
In this article the missing funding note has been added.

References

Goodman B, Flaxman S. European Union regulations on algorithmic decision making and a “right to explanation”. AI Mag, 2017, 38: 50–57
Google Scholar
Chang H, Shokri R. On the privacy risks of algorithmic fairness. In: Proceedings of IEEE European Symposium on Security and Privacy, 2021. 292–303
Google Scholar
Ancona M, Ceolini E, Oztireli C, et al. Towards better understanding of gradient-based attribution methods for deep neural networks. In: Proceedings of International Conference on Learning Representations, 2018
Google Scholar
Ribeiro M T, Singh S, Guestrin C. “Why should I trust you?” explaining the predictions of any classifier. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2016. 1135–1144
Chapter Google Scholar
Lundberg S M, Lee S-I. A unified approach to interpreting model predictions. In: Proceedings of Conference on Neural Information Processing Systems, 2017
Google Scholar
Guidotti R. Counterfactual explanations and how to find them: literature review and benchmarking. Data Min Knowl Disc, 2024, 38: 2770–2824
Article MathSciNet Google Scholar
Bodria F, Giannotti F, Guidotti R, et al. Benchmarking and survey of explanation methods for black box models. Data Min Knowl Disc, 2023, 37: 1719–1778
Article MathSciNet Google Scholar
Guidotti R, Monreale A, Ruggieri S, et al. A survey of methods for explaining black box models. ACM Comput Surv, 2018, 51: 1–42
Article Google Scholar
Gilpin L H, Bau D, Yuan B Z, et al. Explaining explanations: an overview of interpretability of machine learning. In: Proceedings of IEEE International Conference on Data Science and Advanced Analytics, 2018. 80–89
Google Scholar
Goethals S, Sörensen K, Martens D. The privacy issue of counterfactual explanations: explanation linkage attacks. ACM Trans Intell Syst Technol, 2023, 14: 1–24
Article Google Scholar
Ferry J, Aïvodji U, Gambs S, et al. SoK: taming the triangle–on the interplays between fairness, interpretability and privacy in machine learning. 2023. ArXiv:2312.16191
Sokol K, Flach P. Counterfactual explanations of machine learning predictions: opportunities and challenges for AI safety. In: Proceedings of the AAAI Workshop on Artificial Intelligence Safety, 2019
Google Scholar
Luo X, Jiang Y, Xiao X. Feature inference attack on Shapley values. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2022. 2233–2247
Google Scholar
Naretto F, Monreale A, Giannotti F. Evaluating the privacy exposure of interpretable global explainers. In: Proceedings of IEEE International Conference on Cognitive Machine Intelligence, 2022. 13–19
Google Scholar
Artelt A, Vaquet V, Velioglu R, et al. Evaluating robustness of counterfactual explanations. In: Proceedings of IEEE Symposium Series on Computational Intelligence, 2021. 1–9
Google Scholar
Machado G R, Silva E, Goldschmidt R R. Adversarial machine learning in image classification: a survey toward the defender’s perspective. ACM Comput Surv, 2021, 55: 1–38
Article Google Scholar
Biggio B, Roli F. Wild patterns: ten years after the rise of adversarial machine learning. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2018. 2154–2156
Google Scholar
Rigaki M, Garcia S. A survey of privacy attacks in machine learning. ACM Comput Surv, 2023, 56: 1–34
Article Google Scholar
Hu H, Salcic Z, Sun L, et al. Membership inference attacks on machine learning: a survey. ACM Comput Surv, 2022, 54: 1–37
Google Scholar
Liu B, Ding M, Shaham S, et al. When machine learning meets privacy: a survey and outlook. ACM Comput Surv, 2022, 54: 1–36
Google Scholar
Baniecki H, Biecek P. Adversarial attacks and defenses in explainable artificial intelligence: a survey. Inf Fusion, 2024, 107: 102303
Article Google Scholar
Papernot N, McDaniel P, Goodfellow I, et al. Practical black-box attacks against machine learning. In: Proceedings of the ACM on Asia Conference on Computer and Communications Security, 2017. 506–519
Google Scholar
Liu Z, Guo J, Yang W, et al. Privacy-preserving aggregation in federated learning: a survey. IEEE Trans Big Data, 2024. doi: https://doi.org/10.1109/TBDATA.2022.3190835
Google Scholar
Adadi A, Berrada M. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access, 2018, 6: 52138–52160
Article Google Scholar
Došilović F K, Brčić M, Hlupić N. Explainable artificial intelligence: a survey. In: Proceedings of International Convention on Information and Communication Technology, Electronics and Microelectronics, 2018. 210–215
Google Scholar
Banisar D. The right to information and privacy: balancing rights and managing conflicts. World Bank Institute Governance Working Paper, 2011. https://documents.worldbank.org/en/publication/documents-reports/documentdetail/847541468188048435/the-right-to-information-and-privacy-balancing-rights-and-managing-conflicts-access-to-information-program
Book Google Scholar
Vo V, Le T, Nguyen V, et al. Feature-based learning for diverse and privacy-preserving counterfactual explanations. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2023. 2211–2222
Google Scholar
Mochaourab R, Sinha S, Greenstein S, et al. Robust counterfactual explanations for privacy-preserving SVM. In: Proceedings of ICML Workshops, 2021
Google Scholar
Harder F, Bauer M, Park M. Interpretable and differentially private predictions. In: Proceedings of AAAI Conference on Artificial Intelligence, 2020. 34: 4083–4090
Article Google Scholar
Shokri R, Strobel M, Zick Y. On the privacy risks of model explanations. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 2021. 231–241
Chapter Google Scholar
Shokri R, Strobel M, Zick Y. Exploiting transparency measures for membership inference: a cautionary tale. In: Proceedings of the AAAI Workshop on Privacy-Preserving Artificial Intelligence, 2020
Google Scholar
Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classification models and saliency maps. 2013. ArXiv:1312.6034
Bach S, Binder A, Montavon G, et al. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. Plos One, 2015, 10: e0130140
Article Google Scholar
Shrikumar A, Greenside P, Kundaje A. Learning important features through propagating activation differences. In: Proceedings of International Conference on Machine Learning, 2017. 3145–3153
Google Scholar
Sliwinski J, Strobel M, Zick Y. Axiomatic characterization of data-driven influence measures for classification. In: Proceedings of AAAI Conference on Artificial Intelligence, 2019. 33: 718–725
Article Google Scholar
Smilkov D, Thorat N, Kim B, et al. SmoothGrad: removing noise by adding noise. 2017. ArXiv:1706.03825
Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks. In: Proceedings of International Conference on Machine Learning, 2017. 3319–3328
Google Scholar
Miura T, Hasegawa S, Shibahara T. MEGEX: data-free model extraction attack against gradient-based explainable AI. 2021. ArXiv:2107.08909
Springenberg J T, Dosovitskiy A, Brox T, et al. Striving for simplicity: the all convolutional net. 2014. ArXiv:1412.6806
Deng H. Interpreting tree ensembles with intrees. J Dialogue Studies, 2019, 7: 277–287
Google Scholar
Slack D, Hilgard S, Jia E, et al. Fooling lime and shap: adversarial attacks on post hoc explanation methods. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 2020. 180–186
Chapter Google Scholar
Jetchev D, Vuille M. XorSHAP: privacy-preserving explainable AI for decision tree models. Cryptology ePrint Archive, 2023. https://eprint.iacr.org/2023/1859
Google Scholar
Datta A, Sen S, Zick Y. Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In: Proceedings of IEEE Symposium on Security and Privacy, 2016. 598–617
Google Scholar
Štrumbelj E, Kononenko I. Explaining prediction models and individual predictions with feature contributions. Knowl Inf Syst, 2014, 41: 647–665
Article Google Scholar
Maleki S, Tran-Thanh L, Hines G, et al. Bounding the estimation error of sampling-based Shapley value approximation. 2013. ArXiv:1306.4265
Begley T, Schwedes T, Frye C, et al. Explainability for fair machine learning. 2020. ArXiv:2010.07389
Aïvodji U, Hara S, Marchand M, et al. Fooling shap with stealthily biased sampling. In: Proceedings of International Conference on Learning Representations, 2022
Google Scholar
Montenegro H, Silva W, Gaudio A, et al. Privacy-preserving case-based explanations: enabling visual interpretability by protecting privacy. IEEE Access, 2022, 10: 28333–28347
Article Google Scholar
Koh P W, Liang P. Understanding black-box predictions via influence functions. In: Proceedings of International Conference on Machine Learning, 2017. 1885–1894
Google Scholar
Kenny E M, Ford C, Quinn M, et al. Explaining black-box classifiers using post-hoc explanations-by-example: the effect of explanations and error-rates in XAI user studies. Artif Intell, 2021, 294: 103459
Article MathSciNet Google Scholar
Lipton Z C. The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue, 2018, 16: 31–57
Article Google Scholar
Kim B, Rudin C, Shah J A. The Bayesian case model: a generative approach for case-based reasoning and prototype classification. In: Proceedings of Conference on Neural Information Processing Systems, 2014
Google Scholar
Nugent C, Doyle D, Cunningham P. Gaining insight through case-based explanation. J Intell Inf Syst, 2009, 32: 267–295
Article Google Scholar
Angelov P, Soares E. Towards explainable deep neural networks (xDNN). Neural Netws, 2020, 130: 185–194
Article Google Scholar
Angelov P, Soares E. Towards deep machine reasoning: a prototype-based deep neural network with decision tree inference. In: Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, 2020. 2092–2099
Google Scholar
Li O, Liu H, Chen C, et al. Deep learning for case-based reasoning through prototypes: a neural network that explains its predictions. In: Proceedings of AAAI Conference on Artificial Intelligence, 2018
Google Scholar
Chen C, Li O, Tao D, et al. This looks like that: deep learning for interpretable image recognition. In: Proceedings of Conference on Neural Information Processing Systems, 2019
Google Scholar
Papernot N, McDaniel P. Deep k-nearest neighbors: towards confident, interpretable and robust deep learning. 2018. ArXiv:1803.04765
Chen Z, Bei Y, Rudin C. Concept whitening for interpretable image recognition. Nat Mach Intell, 2020, 2: 772–782
Article Google Scholar
Silva W, Poellinger A, Cardoso J S, et al. Interpretability-guided content-based medical image retrieval. In: Proceedings of International Conference on Medical Image Computing and Computer Assisted Intervention, 2020. 305–314
Google Scholar
Kim S, Chae D K. What does a model really look at?: extracting model-oriented concepts for explaining deep neural networks. IEEE Trans Pattern Anal Mach Intell, 2024, 46: 4612–4624
Article Google Scholar
Kenny E M, Keane M T. Twin-systems to explain artificial neural networks using case-based reasoning: comparative tests of feature-weighting methods in ANN-CBR twins for XAI. In: Proceedings of International Joint Conferences on Artificial Intelligence, 2019. 2708–2715
Google Scholar
Kuppa A, Le-Khac N A. Adversarial XAI methods in cybersecurity. IEEE Trans Inform Forensic Secur, 2021, 16: 4924–4938
Article Google Scholar
Wachter S, Mittelstadt B, Russell C. Counterfactual explanations without opening the black box: automated decisions and the GDPR. Harv JL & Tech, 2017, 31: 841
Google Scholar
Dodge J, Liao Q V, Zhang Y, et al. Explaining models: an empirical study of how explanations impact fairness judgment. In: Proceedings of the International Conference on Intelligent User Interfaces, 2019. 275–285
Google Scholar
Binns R, van Kleek M, Veale M, et al. ‘It’s reducing a human being to a percentage’ perceptions of justice in algorithmic decisions. In: Proceedings of CHI Conference on Human Factors in Computing Systems, 2018. 1–14
Google Scholar
Karimi A-H, Schölkopf B, Valera I. Algorithmic recourse: from counterfactual explanations to interventions. In: Proceedings of ACM Conference on Fairness, Accountability, and Transparency, 2021. 353–362
Chapter Google Scholar
Ustun B, Spangher A, Liu Y. Actionable recourse in linear classification. In: Proceedings of ACM Conference on Fairness, Accountability, and Transparency, 2019. 10–19
Chapter Google Scholar
Laugel T, Lesot M-J, Marsala C, et al. Inverse classification for comparison-based interpretability in machine learning. 2017. ArXiv:1712.08443
Dhurandhar A, Chen P-Y, Luss R, et al. Explanations based on the missing: towards contrastive explanations with pertinent negatives. In: Proceedings of Conference on Neural Information Processing Systems, 2018
Google Scholar
Pawelczyk M, Lakkaraju H, Neel S. On the privacy risks of algorithmic recourse. In: Proceedings of International Conference on Artificial Intelligence and Statistics, 2023. 9680–9696
Google Scholar
Mothilal R K, Sharma A, Tan C. Explaining machine learning classifiers through diverse counterfactual explanations. In: Proceedings of ACM Conference on Fairness, Accountability, and Transparency, 2020. 607–617
Google Scholar
Severi G, Meyer J, Coull S, et al. Explanation-guided backdoor poisoning attacks against malware classifiers. In: Proceedings of USENIX, 2021. 1487–1504
Google Scholar
Kuppa A, Le-Khac N-A. Black box attacks on explainable artificial intelligence (XAI) methods in cyber security. In: Proceedings of International Joint Conference on Neural Networks, 2020. 1–8
Google Scholar
Liu M, Liu X, Yan A, et al. Explanation-guided minimum adversarial attack. In: Proceedings of the International Conference on Machine Learning for Cyber Security, 2022. 257–270
Google Scholar
Nguyen T, Lai P, Phan H, et al. XRand: differentially private defense against explanation-guided attacks. In: Proceedings of AAAI Conference on Artificial Intelligence, 2023. 873–881
Google Scholar
Abdukhamidov E, Abuhamad M, Woo S S, et al. Hardening interpretable deep learning systems: investigating adversarial threats and defenses. IEEE Trans Dependable Secure Comput, 2024, 21: 3963–3976
Article Google Scholar
Garcia W, Choi J I, Adari S K, et al. Explainable black-box attacks against model-based authentication. 2018. ArXiv:1810.00024
Zhang X, Wang N, Shen H, et al. Interpretable deep learning under fire. In: Proceedings of USENIX, 2020
Google Scholar
Veale M, Binns R, Edwards L. Algorithms that remember: model inversion attacks and data protection law. Philos Trans R Soc A, 2018, 376: 20180083
Article Google Scholar
Shokri R, Strobel M, Zick Y. Privacy risks of explaining machine learning models. 2019. ArXiv:1907.00164
Yeom S, Giacomelli I, Fredrikson M, et al. Privacy risk in machine learning: analyzing the connection to overfitting. In: Proceedings of IEEE Computer Security Foundations Symposium, 2018. 268–282
Google Scholar
Sablayrolles A, Douze M, Schmid C, et al. White-box vs black-box: Bayes optimal strategies for membership inference. In: Proceedings of International Conference on Machine Learning, 2019. 5558–5567
Google Scholar
Carlini N, Chien S, Nasr M, et al. Membership inference attacks from first principles. In: Proceedings of IEEE Symposium on Security and Privacy, 2022. 1897–1914
Google Scholar
Liu Y, Zhao Z, Backes M, et al. Membership inference attacks by exploiting loss trajectory. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2022. 2085–2098
Google Scholar
Li Z, Liu Y, He X, et al. Auditing membership leakages of multi-exit networks. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2022. 1917–1931
Google Scholar
Ye J, Maddi A, Murakonda S K, et al. Enhanced membership inference attacks against machine learning models. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2022. 3093–3106
Google Scholar
Quan P, Chakraborty S, Jeyakumar J V, et al. On the amplification of security and privacy risks by post-hoc explanations in machine learning models. 2022. ArXiv:2206.14004
Liu H, Wu Y, Yu Z, et al. Please tell me more: privacy impact of explainability through the lens of membership inference attack. In: Proceedings of IEEE Symposium on Security and Privacy, 2024
Google Scholar
Petitcolas F A. Kerckhoffs’ principle. In: Proceedings of Encyclopedia of Cryptography, Security and Privacy, 2023. 1–2
Google Scholar
Craven M W, Shavlik J W. Using sampling and queries to extract rules from trained neural networks. In: Proceedings of Machine Learning Proceedings, 1994. 37–45
Google Scholar
Pawelczyk M, Broelemann K, Kasneci G. Learning model-agnostic counterfactual explanations for tabular data. In: Proceedings of the Web Conference, 2020. 3126–3132
Google Scholar
Huang C, Swoopes C, Xiao C, et al. Accurate, explainable, and private models: providing recourse while minimizing training data leakage. 2023. ArXiv:2308.04341
Sweeney L. Simple demographics often identify people uniquely. Health, 2000, 671: 1–34
Google Scholar
Brughmans D, Leyman P, Martens D. NICE: an algorithm for nearest instance counterfactual explanations. Data Min Knowl Disc, 2024, 38: 2665–2703
Article MathSciNet Google Scholar
Keane M T, Smyth B. Good counterfactuals and where to find them: a case-based technique for generating counterfactuals for explainable AI (XAI). In: Proceedings of Case-Based Reasoning Research and Development. Cham: Springer, 2020. 163–178
Google Scholar
Pawelczyk M, Broelemann K, Kasneci G. On counterfactual explanations under predictive multiplicity. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2020. 809–818
Google Scholar
Aïvodji U, Bolot A, Gambs S. Model extraction from counterfactual explanations. 2020. ArXiv:2009.01884
Dwork C, Smith A, Steinke T, et al. Exposed! A survey of attacks on private data. Annu Rev Stat Appl, 2017, 4: 61–84
Article Google Scholar
Ferry J, Aïvodji U, Gambs S, et al. Probabilistic dataset reconstruction from interpretable models. 2023. ArXiv:2308.15099
Ferry J. Addresing interpretability fairness & privacy in machine learning through combinatorial optimization methods. Dissertation for Ph.D. Degree. Toulouse: Université Paul Sabatier-Toulouse III, 2023
Google Scholar
Garfinkel S, Abowd J M, Martindale C. Understanding database reconstruction attacks on public data. Commun ACM, 2019, 62: 46–53
Article Google Scholar
Song C, Ristenpart T, Shmatikov V. Machine learning models that remember too much. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2017. 587–601
Google Scholar
Carlini N, Liu C, Erlingsson Ú, et al. The secret sharer: evaluating and testing unintended memorization in neural networks. In: Proceedings of USENIX, 2019. 267–284
Google Scholar
Salem A, Bhattacharya A, Backes M, et al. Updates-leak: data set inference and reconstruction attacks in online learning. In: Proceedings of USENIX, 2020. 1291–1308
Google Scholar
Gambs S, Gmati A, Hurfin M. Reconstruction attack through classifier analysis. In: Proceedings of Data and Applications Security and Privacy XXVI, 2012. 274–281
Chapter Google Scholar
Milli S, Schmidt L, Dragan A D, et al. Model reconstruction from model explanations. In: Proceedings of ACM Conference on Fairness, Accountability, and Transparency, 2019. 1–9
Google Scholar
Fredrikson M, Jha S, Ristenpart T. Model inversion attacks that exploit confidence information and basic countermeasures. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2015. 1322–1333
Google Scholar
Yang Z, Zhang J, Chang E-C, et al. Neural network inversion in adversarial setting via background knowledge alignment. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2019. 225–240
Google Scholar
Zhang Y, Jia R, Pei H, et al. The secret revealer: generative model-inversion attacks against deep neural networks. In: Proceedings of Conference on Computer Vision and Pattern Recognition, 2020. 253–261
Google Scholar
Dosovitskiy A, Brox T. Inverting visual representations with convolutional networks. In: Proceedings of Conference on Computer Vision and Pattern Recognition, 2016. 4829–4837
Google Scholar
He Z, Zhang T, Lee R B. Model inversion attacks against collaborative inference. In: Proceedings of the Annual Computer Security Applications Conference, 2019. 148–162
Chapter Google Scholar
Zhao X, Zhang W, Xiao X, et al. Exploiting explanations for model inversion attacks. In: Proceedings of International Conference on Computer Vision, 2021. 682–692
Google Scholar
Dumoulin V, Visin F. A guide to convolution arithmetic for deep learning. 2016. ArXiv:1603.07285
Selvaraju R R, Cogswell M, Das A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of International Conference on Computer Vision, 2017. 618–626
Google Scholar
Rehman A, Rahim R, Nadeem S, et al. End-to-end trained CNN encoder-decoder networks for image steganography. In: Proceedings of European Conference on Computer Vision Workshops, 2019. 723–729
Google Scholar
Zhang Y, Tian Y, Kong Y, et al. Residual dense network for image super-resolution. In: Proceedings of Conference on Computer Vision and Pattern Recognition, 2018. 2472–2481
Google Scholar
Zhou B, Khosla A, Lapedriza A, et al. Learning deep features for discriminative localization. In: Proceedings of Conference on Computer Vision and Pattern Recognition, 2016. 2921–2929
Google Scholar
Miller T. Explanation in artificial intelligence: insights from the social sciences. Artif Intelligence, 2019, 267: 1–38
Article MathSciNet Google Scholar
Song C, Shmatikov V. Overlearning reveals sensitive attributes. In: Proceedings of International Conference on Learning Representations, 2020
Google Scholar
Ganju K, Wang Q, Yang W, et al. Property inference attacks on fully connected neural networks using permutation invariant representations. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2018. 619–633
Google Scholar
Melis L, Song C, de Cristofaro E, et al. Exploiting unintended feature leakage in collaborative learning. In: Proceedings of IEEE Symposium on Security and Privacy, 2019. 691–706
Google Scholar
Zhang W, Tople S, Ohrimenko O. Leakage of dataset properties in multi-party machine learning. In: Proceedings of USENIX, 2021. 2687–2704
Google Scholar
Duddu V, Boutet A. Inferring sensitive attributes from model explanations. In: Proceedings of ACM International Conference on Information and Knowledge Management, 2022. 416–425
Google Scholar
Chen J, Song L, Wainwright M, et al. Learning to explain: an information-theoretic perspective on model interpretation. In: Proceedings of International Conference on Machine Learning, 2018. 883–892
Google Scholar
Salem A, Zhang Y, Humbert M, et al. ML-leaks: model and data independent membership inference attacks and defenses on machine learning models. 2018. ArXiv:1806.01246
Tramèr F, Zhang F, Juels A, et al. Stealing machine learning models via prediction APIs. In: Proceedings of USENIX, 2016. 601–618
Google Scholar
Jagielski M, Carlini N, Berthelot D, et al. High accuracy and high fidelity extraction of neural networks. In: Proceedings of USENIX, 2020. 1345–1362
Google Scholar
Wang Y, Qian H, Miao C. DualCF: efficient model extraction attack from counterfactual explanations. In: Proceedings of ACM Conference on Fairness, Accountability, and Transparency, 2022. 1318–1329
Google Scholar
Nguyen D, Bui N, Nguyen V A. Feasible recourse plan via diverse interpolation. In: Proceedings of International Conference on Artificial Intelligence and Statistics, 2023. 4679–4698
Google Scholar
Artelt A, Hammer B. Convex density constraints for computing plausible counterfactual explanations. In: Proceedings of Artificial Neural Networks and Machine Learning, 2020. 353–365
Google Scholar
Kumari K, Jadliwala M, Jha S K, et al. Towards a game-theoretic understanding of explanation-based membership inference attacks. 2024. ArXiv:2404.07139
Luo X, Wu Y, Xiao X, et al. Feature inference attack on model predictions in vertical federated learning. In: Proceedings of IEEE International Conference on Data Engineering, 2021. 181–192
Google Scholar
Barocas S, Selbst A D, Raghavan M. The hidden assumptions behind counterfactual explanations and principal reasons. In: Proceedings of ACM Conference on Fairness, Accountability, and Transparency, 2020. 80–89
Google Scholar
Kasirzadeh A, Smart A. The use and misuse of counterfactuals in ethical machine learning. In: Proceedings of ACM Conference on Fairness, Accountability, and Transparency, 2021. 228–236
Chapter Google Scholar
Hashemi M, Fathi A. PermuteAttack: counterfactual explanation of machine learning credit scorecards. 2020. ArXiv:2008.10138
Dwork C, Roth A. The algorithmic foundations of differential privacy. FNT Theor Comput Sci, 2014, 9: 211–407
Article MathSciNet Google Scholar
Patel N, Shokri R, Zick Y. Model explanations with differential privacy. In: Proceedings of ACM Conference on Fairness, Accountability, and Transparency, 2022. 1895–1904
Google Scholar
Yang F, Feng Q, Zhou K, et al. Differentially private counterfactuals via functional mechanism. 2022. ArXiv:2208.02878
Hamer J, Valladares J, Viswanathan V, et al. Simple steps to success: axiomatics of distance-based algorithmic recourse. 2023. ArXiv:2306.15557
Pentyala S, Sharma S, Kariyappa S, et al. Privacy-preserving algorithmic recourse. 2023. ArXiv:2311.14137
Holohan N, Braghin S, Aonghusa P M, et al. Diffprivlib: the IBM differential privacy library. 2019. ArXiv:1907.02444
Chaudhuri K, Monteleoni C, Sarwate A D. Differentially private empirical risk minimization. J Mach Learn Res, 2011, 12: 1069–1109
MathSciNet Google Scholar
Wang D, Ye M, Xu J. Differentially private empirical risk minimization revisited: faster and more general. In: Proceedings of Conference on Neural Information Processing Systems, 2017
Google Scholar
Joshi D, Thakkar J. k-means subclustering: a differentially private algorithm with improved clustering quality. In: Proceedings of ACM International Conference on Information and Knowledge Management, 2022
Google Scholar
Lu Z, Shen H. Differentially private k-means clustering with convergence guarantee. IEEE Trans Dependable Secure Comput, 2020, 18: 1541–1552
Google Scholar
Abadi M, Chu A, Goodfellow I, et al. Deep learning with differential privacy. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2016. 308–318
Google Scholar
Wagner T, Naamad Y, Mishra N. Fast private kernel density estimation via locality sensitive quantization. In: Proceedings of International Conference on Machine Learning, 2023. 339–367
Google Scholar
Wang G. Interpret federated learning with Shapley values. 2019. ArXiv:1905.04519
Watson L, Andreeva R, Yang H-T, et al. Differentially private Shapley values for data evaluation. 2022. ArXiv:2206.00511
Naidu R, Priyanshu A, Kumar A, et al. When differential privacy meets interpretability: a case study. 2021. ArXiv:2106.13203
Bu Z, Wang Y-X, Zha S, et al. Differentially private optimization on large model at small cost. In: Proceedings of International Conference on Machine Learning, 2023. 3192–3218
Google Scholar
Hooker S, Erhan D, Kindermans P-J, et al. A benchmark for interpretability methods in deep neural networks. In: Proceedings of Conference on Neural Information Processing Systems, 2019
Google Scholar
Veugen T, Kamphorst B, Marcus M. Privacy-preserving contrastive explanations with local foil trees. In: Cyber Security, Cryptology, and Machine Learning. Cham: Springer, 2022
Google Scholar
van der Waa J, Robeer M, van Diggelen J, et al. Contrastive explanations with local foil trees. 2018. ArXiv:1806.07470
Lindell Y. Secure multiparty computation. Commun ACM, 2021, 64: 86–96
Article Google Scholar
Jia J, Salem A, Backes M, et al. MemGuard: defending against black-box membership inference attacks via adversarial examples. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2019. 259–274
Google Scholar
Olatunji I E, Rathee M, Funke T, et al. Private graph extraction via feature explanations. PoPETs, 2023, 2023: 59–78
Article Google Scholar
Montenegro H, Silva W, Cardoso J S. Privacy-preserving generative adversarial network for case-based explainability in medical image analysis. IEEE Access, 2021, 9: 148037–148047
Article Google Scholar
Chen J, Konrad J, Ishwar P. VGAN-based image representation learning for privacy-preserving facial expression recognition. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018. 1570–1579
Google Scholar
Montavon G, Lapuschkin S, Binder A, et al. Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recogn, 2017, 65: 211–222
Article Google Scholar
Gade K, Geyik S C, Kenthapadi K, et al. Explainable AI in industry. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019. 3203–3204
Chapter Google Scholar
Kaur H, Nori H, Jenkins S, et al. Interpreting interpretability: understanding data scientists’ use of interpretability tools for machine learning. In: Proceedings of CHI Conference on Human Factors in Computing Systems, 2020. 1–14
Google Scholar
Hu S, Liu X, Zhang Y, et al. Protecting facial privacy: generating adversarial identity masks via style-robust makeup transfer. In: Proceedings of Conference on Computer Vision and Pattern Recognition, 2022. 14–23
Google Scholar
Liu H, Wang Y, Zhang Z, et al. Matrix factorization recommender based on adaptive Gaussian differential privacy for implicit feedback. Inf Process Manage, 2024, 61: 103720
Article Google Scholar
Liu Z, Jiang Y, Jiang W, et al. Guaranteeing data privacy in federated unlearning with dynamic user participation. 2024. ArXiv:2406.00966
Mi D, Zhang Y, Zhang L Y, et al. Towards model extraction attacks in GAN-based image translation via domain shift mitigation. In: Proceedings of AAAI Conference on Artificial Intelligence, 2024. 902–910
Google Scholar
Zhang Y, Hu S, Zhang L Y, et al. Why does little robustness help? A further step towards understanding adversarial transferability. In: Proceedings of IEEE Symposium on Security and Privacy, 2024
Google Scholar
Nguyen T T, Huynh T T, Ren Z, et al. A survey of machine unlearning. 2022. ArXiv:2209.02299
Huynh T T, Nguyen T B, Nguyen P L, et al. Fast-FedUL: a training-free federated unlearning with provable skew resilience. In: Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases, 2024
Google Scholar
Liu Z, Guo J, Yang W, et al. Dynamic user clustering for efficient and privacy-preserving federated learning. IEEE Trans Dependable Secure Comput, 2024. doi: https://doi.org/10.1109/TDSC.2024.3355458
Google Scholar
Li Z, Chen H, Ni Z, et al. Balancing privacy protection and interpretability in federated learning. 2023. ArXiv:2302.08044
Zhang J, Bareinboim E. Fairness in decision-making—the causal explanation formula. In: Proceedings of AAAI Conference on Artificial Intelligence, 2018
Google Scholar
Frye C, de Mijolla D, Begley T, et al. Shapley explainability on the data manifold. In: Proceedings of International Conference on Learning Representations, 2021
Google Scholar
Mittelstadt B, Russell C, Wachter S. Explaining explanations in AI. In: Proceedings of ACM Conference on Fairness, Accountability, and Transparency, 2019. 279–288
Chapter Google Scholar
Gillenwater J, Joseph M, Kulesza A. Differentially private quantiles. In: Proceedings of International Conference on Machine Learning, 2021. 3713–3722
Google Scholar
Ghosh A, Shanbhag A, Wilson C. FairCanary: rapid continuous explainable fairness. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 2022. 307–316
Chapter Google Scholar
Li Z, van Leeuwen M. Explainable contextual anomaly detection using quantile regression forests. Data Min Knowl Disc, 2023, 37: 2517–2563
Article MathSciNet Google Scholar
Merz M, Richman R, Tsanakas A, et al. Interpreting deep learning models with marginal attribution by conditioning on quantiles. Data Min Knowl Disc, 2022, 36: 1335–1370
Article MathSciNet Google Scholar
Alvarez-Melis D, Jaakkola T. Towards robust interpretability with self-explaining neural networks. In: Proceedings of Conference on Neural Information Processing Systems, 2018
Google Scholar
Zhang Z, Liu Q, Wang H, et al. ProtGNN: towards self-explaining graph neural networks. In: Proceedings of AAAI Conference on Artificial Intelligence, 2022. 9127–9135
Google Scholar
Khosla M. Privacy and transparency in graph machine learning: a unified perspective. 2022. ArXiv:2207.10896
Tiddi I, Schlobach S. Knowledge graphs as tools for explainable machine learning: a survey. Artif Intell, 2022, 302: 103627
Article MathSciNet Google Scholar
Rajabi E, Etminani K. Knowledge-graph-based explainable AI: a systematic review. J Inf Sci, 2024, 50: 1019–1029
Article Google Scholar
Qian J, Li X Y, Zhang C, et al. Social network de-anonymization and privacy inference with knowledge graph model. IEEE Trans Dependable Secure Comput, 2019, 16: 679–692
Article Google Scholar
Wang Y, Huang L, Yu P S, et al. Membership inference attacks on knowledge graphs. 2021. ArXiv:2104.08273
Domingo-Ferrer J, Pérez-Solà C, Blanco-Justicia A. Collaborative explanation of deep models with limited interaction for trade secret and privacy preservation. In: Proceedings of WWW Companion, 2019. 501–507
Google Scholar
Gaudio A, Smailagic A, Faloutsos C, et al. DeepFixCX: explainable privacy-preserving image compression for medical image analysis. WIREs Data Min Knowl, 2023, 13: e1495
Article Google Scholar
Wu Z, Pan S, Chen F, et al. A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst, 2021, 32: 4–24
Article MathSciNet Google Scholar
Liu Z, Luong N C, Wang W, et al. A survey on blockchain: a game theoretical perspective. IEEE Access, 2019, 7: 47615–47643
Article Google Scholar
Yuan H, Yu H, Gui S, et al. Explainability in graph neural networks: a taxonomic survey. IEEE Trans Pattern Anal Mach Intell, 2022, 45: 5782–5799
Google Scholar
Prado-Romero M A, Prenkaj B, Stilo G, et al. A survey on graph counterfactual explanations: definitions, methods, evaluation, and research challenges. ACM Comput Surv, 2024, 56: 1–37
Article Google Scholar
Dai E, Zhao T, Zhu H, et al. A comprehensive survey on trustworthy graph neural networks: privacy, robustness, fairness, and explainability. 2022. ArXiv:2204.08570
Ren Z, Qian K, Schultz T, et al. An overview of the ICASSP special session on AI security and privacy in speech and audio processing. In: Proceedings of ACM Multimedia Workshop, 2023
Google Scholar
Li Z, Shi C, Zhang T, et al. Robust detection of machine-induced audio attacks in intelligent audio systems with microphone array. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2021. 1884–1899
Google Scholar
Carlini N, Wagner D. Audio adversarial examples: targeted attacks on speech-to-text. In: Proceedings of IEEE Security and Privacy Workshops, 2018. 1–7
Google Scholar
Abdullah H, Warren K, Bindschaedler V, et al. SoK: the faults in our ASRs: an overview of attacks against automatic speech recognition and speaker identification systems. In: Proceedings of IEEE Symposium on Security and Privacy, 2021. 730–747
Google Scholar
Ren Z, Qian K, Dong F, et al. Deep attention-based neural networks for explainable heart sound classification. Machine Learn Appl, 2022, 9: 100322
Google Scholar
Ren Z, Baird A, Han J, et al. Generating and protecting against adversarial attacks for deep speech-based emotion recognition models. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2020. 7184–7188
Google Scholar
Chang Y, Ren Z, Nguyen T T, et al. Example-based explanations with adversarial attacks for respiratory sound analysis. In: Proceedings of Interspeech, 2022. 1–5
Google Scholar
Liu Z, Guo J, Yang M, et al. Privacy-enhanced knowledge transfer with collaborative split learning over teacher ensembles. In: Proceedings of Secure and Trustworthy Deep Learning Systems Workshop, 2023. 1–13
Google Scholar
Liu Z, Lin H Y, Liu Y. Long-term privacy-preserving aggregation with user-dynamics for federated learning. IEEE Trans Inform Forensic Secur, 2023, 18: 2398–2412
Article Google Scholar
Liu Z, Guo J, Lam K Y, et al. Efficient dropout-resilient aggregation for privacy-preserving machine learning. IEEE Trans Inform Forensic Secur, 2023, 18: 1839–1854
Article Google Scholar
Belhadj-Cheikh N, Imine A, Rusinowitch M. FOX: fooling with explanations: privacy protection with adversarial reactions in social media. In: Proceedings of International Conference on Privacy, Security and Trust, 2021. 1–10
Google Scholar
Jia R, Dao D, Wang B, et al. Towards efficient data valuation based on the Shapley value. In: Proceedings of International Conference on Artificial Intelligence and Statistics, 2019. 1167–1176
Google Scholar

Download references

Acknowledgements

This work was supported by ARC Discovery Early Career Researcher Award (Grant No. DE200101465) and ARC DP Project (Grant No. DP240101108).

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions.

Author information

Authors and Affiliations

School of Information and Communication Technology, Griffith University, Gold Coast, QLD, 4215, Australia
Thanh Tam Nguyen & Quoc Viet Hung Nguyen
School of Computer and Communication Sciences, Ecole Polytechnique Federale de Lausanne, Lausanne, 1015, Switzerland
Thanh Trung Huynh
Faculty of Mathematics and Computer Science, University of Bremen, Bremen, 28359, Germany
Zhao Ren
Faculty of Information Technology, HUTECH University, Ho Chi Minh City, 70000, Vietnam
Thanh Toan Nguyen
Department of Computer Science, Hanoi University of Science and Technology, Hanoi, 10000, Vietnam
Phi Le Nguyen
School of Electrical Engineering and Computer Science, The University of Queensland, Brisbane, QLD, 4072, Australia
Hongzhi Yin

Authors

Thanh Tam Nguyen
View author publications
Search author on:PubMed Google Scholar
Thanh Trung Huynh
View author publications
Search author on:PubMed Google Scholar
Zhao Ren
View author publications
Search author on:PubMed Google Scholar
Thanh Toan Nguyen
View author publications
Search author on:PubMed Google Scholar
Phi Le Nguyen
View author publications
Search author on:PubMed Google Scholar
Hongzhi Yin
View author publications
Search author on:PubMed Google Scholar
Quoc Viet Hung Nguyen
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Thanh Toan Nguyen, Hongzhi Yin or Quoc Viet Hung Nguyen.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Nguyen, T.T., Huynh, T.T., Ren, Z. et al. Privacy-preserving explainable AI: a survey. Sci. China Inf. Sci. 68, 111101 (2025). https://doi.org/10.1007/s11432-024-4123-4

Download citation

Received: 04 April 2024
Revised: 26 June 2024
Accepted: 07 August 2024
Published: 07 November 2024
Version of record: 07 November 2024
DOI: https://doi.org/10.1007/s11432-024-4123-4

Privacy-preserving explainable AI: a survey

Abstract

Article PDF

Similar content being viewed by others

Privacy-Aware Explanations for Team Formation

Privacy and Security Considerations in Explainable AI

Balancing XAI with Privacy and Security Considerations

Change history

07 July 2025

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Privacy-preserving explainable AI: a survey

Abstract

Article PDF

Similar content being viewed by others

Privacy-Aware Explanations for Team Formation

Privacy and Security Considerations in Explainable AI

Balancing XAI with Privacy and Security Considerations

Explore related subjects

Change history

07 July 2025

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords