research-article

Should Fairness be a Metric or a Model? A Model-based Framework for Assessing Bias in Machine Learning Pipelines

Authors:
John P. Lalor

University of Notre Dame, Notre Dame, USA

University of Notre Dame, Notre Dame, USA

0000-0003-0848-4786
View Profile

,
Ahmed Abbasi

University of Notre Dame, Notre Dame, USA

University of Notre Dame, Notre Dame, USA

0000-0001-7698-7794
View Profile

,
Kezia Oketch

University of Notre Dame, Notre Dame, USA

University of Notre Dame, Notre Dame, USA

0009-0000-1089-2530
View Profile

,
Yi Yang

Hong Kong University of Science and Technology, Hong Kong, China

Hong Kong University of Science and Technology, Hong Kong, China

0000-0001-8863-112X
View Profile

,
Nicole Forsgren

Microsoft Research, Seattle, USA

Microsoft Research, Seattle, USA

0000-0003-2263-9326
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 42 Issue 4Article No.: 99pp 1–41https://doi.org/10.1145/3641276

Published:22 March 2024Publication History

ACM Transactions on Information Systems

Abstract

Fairness measurement is crucial for assessing algorithmic bias in various types of machine learning (ML) models, including ones used for search relevance, recommendation, personalization, talent analytics, and natural language processing. However, the fairness measurement paradigm is currently dominated by fairness metrics that examine disparities in allocation and/or prediction error as univariate key performance indicators (KPIs) for a protected attribute or group. Although important and effective in assessing ML bias in certain contexts such as recidivism, existing metrics don’t work well in many real-world applications of ML characterized by imperfect models applied to an array of instances encompassing a multivariate mixture of protected attributes, that are part of a broader process pipeline. Consequently, the upstream representational harm quantified by existing metrics based on how the model represents protected groups doesn’t necessarily relate to allocational harm in the application of such models in downstream policy/decision contexts. We propose FAIR-Frame, a model-based framework for parsimoniously modeling fairness across multiple protected attributes in regard to the representational and allocational harm associated with the upstream design/development and downstream usage of ML models. We evaluate the efficacy of our proposed framework on two testbeds pertaining to text classification using pretrained language models. The upstream testbeds encompass over fifty thousand documents associated with twenty-eight thousand users, seven protected attributes and five different classification tasks. The downstream testbeds span three policy outcomes and over 5.41 million total observations. Results in comparison with several existing metrics show that the upstream representational harm measures produced by FAIR-Frame and other metrics are significantly different from one another, and that FAIR-Frame’s representational fairness measures have the highest percentage alignment and lowest error with allocational harm observed in downstream applications. Our findings have important implications for various ML contexts, including information retrieval, user modeling, digital platforms, and text classification, where responsible and trustworthy AI is becoming an imperative.

REFERENCES

[1] Abbasi Ahmed, Chen Hsinchun, and Salem Arab. 2008. Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. ACM Transactions on Information Systems 26, 3 (2008), 1–34.Google ScholarDigital Library
[2] Abbasi Ahmed, Chiang Roger H. L., and Xu Jennifer. 2023. Data science for social good. Journal of the Association for Information Systems 24, 6 (2023), 1439–1458.Google ScholarCross Ref
[3] Abbasi Ahmed, Dobolyi David, Lalor John P., Netemeyer Richard G., Smith Kendall, and Yang Yi. 2021. Constructing a psychometric testbed for fair natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 3748–3758.Google ScholarCross Ref
[4] Abbasi Ahmed, France Stephen, Zhang Zhu, and Chen Hsinchun. 2010. Selecting attributes for sentiment classification using feature relation networks. IEEE Transactions on Knowledge and Data Engineering 23, 3 (2010), 447–462.Google ScholarDigital Library
[5] A. Abbasi, J. Li, G. Clifford, and H. Taylor. 2018. Make ‘fairness by design’ part of machine learning. Harvard Business Review, August 1. https://hbr.org/2018/08/makefairness-by-design-part-of-machine-learningGoogle Scholar
[6] Abbasi Ahmed, Sarker Suprateek, and Chiang Roger H. L.. 2016. Big data research in information systems: Toward an inclusive research agenda. Journal of the Association for Information Systems 17, 2 (2016), 3.Google ScholarCross Ref
[7] Agrawal Ajay, Gans Joshua, and Goldfarb Avi. 2018. Prediction Machines: The Simple Economics of Artificial Intelligence. Harvard Business Press.Google Scholar
[8] Ahmad Faizan, Abbasi Ahmed, Kitchens Brent, Adjeroh Donald, and Zeng Daniel. 2022. Deep learning for adverse event detection from web search. IEEE Transactions on Knowledge and Data Engineering 34, 6 (2022), 2681–2695.Google Scholar
[9] Ahmad Faizan, Abbasi Ahmed, Li Jingjing, Dobolyi David G., Netemeyer Richard G., Clifford Gari D., and Chen Hsinchun. 2020. A deep learning architecture for psychometric natural language processing. ACM Transactions on Information Systems 38, 1 (2020), 1–29.Google ScholarDigital Library
[10] Akaike H.. 1973. Information theory and an extension of the maximum likelihood principle. In Proceedings of the 2nd International Symposium on Information Theory, 1973. Akademiai Kiado.Google Scholar
[11] Arguello Jaime and Choi Bogeum. 2019. The effects of working memory, perceptual speed, and inhibition in aggregated search. ACM Transactions on Information Systems 37, 3 (2019), 1–34.Google ScholarDigital Library
[12] Awwad Yazeed, Fletcher Richard, Frey Daniel, Gandhi Amit, Najafian Maryam, and Teodorescu Mike. 2020. Exploring Fairness in Machine Learning for International Development. Technical Report. CITE MIT D-Lab.Google Scholar
[13] Barocas Solon, Crawford Kate, Shapiro Aaron, and Wallach Hanna. 2017. The problem with bias: Allocative versus representational harms in machine learning. In Proceedings of the 9th Annual Conference of the Special Interest Group for Computing, Information and Society.Google Scholar
[14] Barocas Solon and Selbst Andrew D.. 2016. Big data’s disparate impact. California Law Review 104, 3 (2016), 671–732.Google Scholar
[15] Bender Emily M., Gebru Timnit, McMillan-Major Angelina, and Shmitchell Shmargaret. 2021. On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 610–623.Google ScholarDigital Library
[16] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (2012), 281–305.Google Scholar
[17] Berk Richard A., Kuchibhotla Arun Kumar, and Tchetgen Eric Tchetgen. 2022. Fair risk algorithms. Annual Review of Statistics and Its Application 10 (2022), 165–187.Google Scholar
[18] Michael L. Bernauer. 2017. Mlbernauer/drugstandards: Python library for standardizing drug names (v0.1). Zenodo. Google ScholarCross Ref
[19] S. L. Blodgett, S. Barocas, H. Daumé III, and H. Wallach. 2020. Language (Technology) is power: A critical survey of “Bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 5454–5476.Google Scholar
[20] Blodgett Su Lin, Green Lisa, and O’Connor Brendan. 2016. Demographic dialectal variation in social media: A case study of african-american english. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1119–1130.Google ScholarCross Ref
[21] Bollen Kenneth A. and Noble Mark D.. 2011. Structural equation models and the quantification of behavior. Proceedings of the National Academy of Sciences 108, supplement_3 (2011), 15639–15646.Google ScholarCross Ref
[22] Bolukbasi Tolga, Chang Kai-Wei, Zou James Y., Saligrama Venkatesh, and Kalai Adam T.. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in Neural Information Processing Systems 29 (2016).Google Scholar
[23] Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren E. Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas F. Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, O. Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir P. Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Benjamin Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, J. F. Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Robert Reich, Hongyu Ren, Frieda Rong, Yusuf H. Roohani, Camilo Ruiz, Jack Ryan, Christopher R’e, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishna Parasuram Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei A. Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 2021. On the opportunities and risks of foundation models. ArXiv (2021). Retrieved from https://crfm.stanford.edu/assets/report.pdfGoogle Scholar
[24] Bose Avishek and Hamilton William. 2019. Compositional fairness constraints for graph embeddings. In Proceedings of the International Conference on Machine Learning. PMLR, 715–724.Google Scholar
[25] Bower Amanda, Kitchen Sarah N., Niss Laura, Strauss Martin J., Vargas Alexander, and Venkatasubramanian Suresh. 2017. Fair pipelines. In Proceedings of the Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML).Google Scholar
[26] Brown Donald E., Abbasi Ahmed, and Lau Raymond Y. K.. 2015. Predictive analytics: Predictive modeling at the micro level. IEEE Intelligent Systems 30, 3 (2015), 6–8.Google ScholarDigital Library
[27] Buolamwini Joy and Gebru Timnit. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the Conference on Fairness, Accountability and Transparency. PMLR, 77–91.Google Scholar
[28] Robin Burke. 2017. Multisided fairness for recommendation. 2017 Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML’17).Google Scholar
[29] Burtch Gordon, Hong Yili, Bapna Ravi, and Griskevicius Vladas. 2018. Stimulating online reviews by combining financial incentives and social norms. Management Science 64, 5 (2018), 2065–2082.Google ScholarDigital Library
[30] Cabrera Ángel Alexander, Epperson Will, Hohman Fred, Kahng Minsuk, Morgenstern Jamie, and Chau Duen Horng. 2019. FairVis: Visual analytics for discovering intersectional bias in machine learning. In Proceedings of the 2019 IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, 46–56.Google ScholarCross Ref
[31] Caliskan Aylin, Bryson Joanna J., and Narayanan Arvind. 2017. Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183–186.Google ScholarCross Ref
[32] Cavazos Jacqueline G., Phillips P Jonathon, Castillo Carlos D., and O’Toole Alice J.. 2020. Accuracy comparison across face recognition algorithms: Where are we on measuring race bias? IEEE Transactions on Biometrics, Behavior, and Identity Science 3, 1 (2020), 101–111.Google ScholarCross Ref
[33] Charlesworth Tessa E. S., Caliskan Aylin, and Banaji Mahzarin R.. 2022. Historical representations of social groups across 200 years of word embeddings from Google Books. Proceedings of the National Academy of Sciences 119, 28 (2022), e2121798119.Google ScholarCross Ref
[34] Jiawei Chen, Hande Dong, Xiang Wang, Fuli Feng, Meng Wang, and Xiangnan He. 2023. Bias and debias in recommender system: A survey and future directions. ACM Trans. Inf. Syst. 41, 3, Article 67 (July 2023), 39 pages. Google ScholarDigital Library
[35] Chouldechova Alexandra and Roth Aaron. 2020. A snapshot of the frontiers of fairness in machine learning. Communications of the ACM 63, 5 (2020), 82–89.Google ScholarDigital Library
[36] De-Arteaga Maria, Romanov Alexey, Wallach Hanna, Chayes Jennifer, Borgs Christian, Chouldechova Alexandra, Geyik Sahin, Kenthapadi Krishnaram, and Kalai Adam Tauman. 2019. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 120–128.Google ScholarDigital Library
[37] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. DOI:Google ScholarCross Ref
[38] Dressel Julia and Farid Hany. 2018. The accuracy, fairness, and limits of predicting recidivism. Science Advances 4, 1 (2018), eaao5580.Google ScholarCross Ref
[39] Elsken Thomas, Metzen Jan Hendrik, and Hutter Frank. 2019. Neural architecture search: A survey. The Journal of Machine Learning Research 20, 1 (2019), 1997–2017.Google ScholarDigital Library
[40] Emelianov Vitalii, Arvanitakis George, Gast Nicolas, Gummadi Krishna P., and Loiseau Patrick. 2019. The price of local fairness in multistage selection. In Proceedings of the IJCAI-2019-28th International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, 5836–5842.Google ScholarCross Ref
[41] Friedler Sorelle A., Scheidegger Carlos, Venkatasubramanian Suresh, Choudhary Sonam, Hamilton Evan P., and Roth Derek. 2019. A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 329–338.Google ScholarDigital Library
[42] Friedman Batya and Nissenbaum Helen. 1996. Bias in computer systems. ACM Transactions on Information Systems 14, 3 (1996), 330–347.Google ScholarDigital Library
[43] Fu Tianjun, Abbasi Ahmed, Zeng Daniel, and Chen Hsinchun. 2012. Sentimental spidering: Leveraging opinion information in focused crawlers. ACM Transactions on Information Systems 30, 4 (2012), 1–30.Google ScholarDigital Library
[44] Garg Nikhil, Schiebinger Londa, Jurafsky Dan, and Zou James. 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences 115, 16 (2018), E3635–E3644.Google ScholarCross Ref
[45] Goldfarb-Tarrant Seraphina, Marchant Rebecca, Sánchez Ricardo Muñoz, Pandya Mugdha, and Lopez Adam. 2021. Intrinsic bias metrics do not correlate with application bias. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 1926–1940.Google ScholarCross Ref
[46] Guo Yue, Yang Yi, and Abbasi Ahmed. 2022. Auto-debias: Debiasing masked language models with automated biased prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1012–1023.Google ScholarCross Ref
[47] He Xiangnan, Ren Zhaochun, Yilmaz Emine, Najork Marc, and Chua Tat-Seng. 2021. Graph technologies for user modeling and recommendation: Introduction to the special issue - part 1. ACM Transactions on Information Systems 40, 2, Article 21 (2021), 5 pages. DOI:Google ScholarDigital Library
[48] He Xiangnan, Ren Zhaochun, Yilmaz Emine, Najork Marc, and Chua Tat-Seng. 2021. Introduction to the special section on graph technologies for user modeling and recommendation, part 2. ACM Transactions on Information Systems 40, 3, Article 42 (2021), 5 pages. DOI:Google ScholarDigital Library
[49] He Xin, Zhao Kaiyong, and Chu Xiaowen. 2021. AutoML: A survey of the state-of-the-art. Knowledge-Based Systems 212 (2021), 106622.Google ScholarCross Ref
[50] Jake M. Hofman, Duncan J. Watts, Susan Athey, Filiz Garip, Thomas L. Griffiths, Jon Kleinberg, Helen Margetts, Sendhil Mullainathan, Matthew J. Salganik, Simine Vazire, Alessandro Vespignani, and Tal Yarkoni. 2021. Integrating explanation and prediction in computational social science. Nature 595, 7866 (2021), 181–188.Google ScholarCross Ref
[51] Kaneko Masahiro and Bollegala Danushka. 2019. Gender-preserving debiasing for pre-trained word embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1641–1650.Google ScholarCross Ref
[52] Kaneko Masahiro and Bollegala Danushka. 2021. Debiasing pre-trained contextualised embeddings. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 1256–1266.Google ScholarCross Ref
[53] Kiritchenko Svetlana and Mohammad Saif. 2018. Examining gender and race bias in two hundred sentiment analysis systems. In Proceedings of the 7th Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics, New Orleans, Louisiana, 43–53. DOI:Google ScholarCross Ref
[54] Kitchens Brent, Dobolyi David, Li Jingjing, and Abbasi Ahmed. 2018. Advanced customer analytics: Strategic value through integration of relationship-oriented big data. Journal of Management Information Systems 35, 2 (2018), 540–574.Google ScholarCross Ref
[55] Kleinberg Jon, Ludwig Jens, Mullainathan Sendhil, and Obermeyer Ziad. 2015. Prediction policy problems. American Economic Review 105, 5 (2015), 491–495.Google ScholarCross Ref
[56] Koenecke Allison, Nam Andrew, Lake Emily, Nudell Joe, Quartey Minnie, Mengesha Zion, Toups Connor, Rickford John R., Jurafsky Dan, and Goel Sharad. 2020. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences 117, 14 (2020), 7684–7689.Google ScholarCross Ref
[57] Lalor John, Yang Yi, Smith Kendall, Forsgren Nicole, and Abbasi Ahmed. 2022. Benchmarking intersectional biases in NLP. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 3598–3609. DOI:Google ScholarCross Ref
[58] Lee Min Kyung, Jain Anuraag, Cha Hea Jin, Ojha Shashank, and Kusbit Daniel. 2019. Procedural justice in algorithmic fairness: Leveraging transparency and outcome control for fair algorithmic mediation. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–26.Google ScholarDigital Library
[59] Li Bo, Qi Peng, Liu Bo, Di Shuai, Liu Jingen, Pei Jiquan, Yi Jinfeng, and Zhou Bowen. 2023. Trustworthy ai: From principles to practices. ACM Computing Surveys 55, 9 (2023), 1–46.Google ScholarDigital Library
[60] Limsopatham Nut and Collier Nigel. 2016. Normalising medical concepts in social media texts by learning semantic representation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1014–1023.Google ScholarCross Ref
[61] Lipscomb Carolyn E.. 2000. Medical subject headings (MeSH). Bulletin of the Medical Library Association 88, 3 (2000), 265.Google Scholar
[62] Dugang Liu, Pengxiang Cheng, Zinan Lin, Xiaolian Zhang, Zhenhua Dong, Rui Zhang, Xiuqiang He, Weike Pan, and Zhong Ming. 2023. Bounding system-induced biases in recommender systems with a randomized dataset. ACM Trans. Inf. Syst. 41, 4, Article 108 (October 2023), 26 pages. Google ScholarDigital Library
[63] Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692Google Scholar
[64] Zhongzhou Liu, Yuan Fang, and Min Wu. 2023. Mitigating popularity bias for users and items with fairness-centric adaptive recommendation. ACM Trans. Inf. Syst. 41, 3, Article 55 (July 2023), 27 pages. Google ScholarDigital Library
[65] Lum Kristian, Zhang Yunfeng, and Bower Amanda. 2022. De-biasing “bias” measurement. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 379–389.Google ScholarDigital Library
[66] Ma Jing, Guo Ruocheng, Wan Mengting, Yang Longqi, Zhang Aidong, and Li Jundong. 2022. Learning fair node representations with graph counterfactual fairness. In Proceedings of the 15th ACM International Conference on Web Search and Data Mining. 695–703.Google ScholarDigital Library
[67] Madras David, Creager Elliot, Pitassi Toniann, and Zemel Richard. 2018. Learning adversarially fair and transferable representations. In Proceedings of the International Conference on Machine Learning. PMLR, 3384–3393.Google Scholar
[68] Mansoury Masoud, Abdollahpouri Himan, Pechenizkiy Mykola, Mobasher Bamshad, and Burke Robin. 2022. A graph-based approach for mitigating multi-sided exposure bias in recommender systems. ACM Transactions on Information Systems 40, 2, Article 32 (2022), 31 pages. DOI:Google ScholarDigital Library
[69] Mehrabi Ninareh, Morstatter Fred, Saxena Nripsuta, Lerman Kristina, and Galstyan Aram. 2021. A survey on bias and fairness in machine learning. ACM Computing Surveys 54, 6 (2021), 1–35.Google ScholarDigital Library
[70] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg S., and Dean Jeff. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (2013).Google Scholar
[71] L. Morse, M. H. M. Teodorescu, Y. Awwad, et al. 2022. Do the ends justify the means? Variation in the distributive and procedural fairness of machine learning algorithms. J. Bus Ethics 181 (2022), 1083–1095. Google ScholarCross Ref
[72] Narayanan Arvind. 2018. Translation tutorial: 21 fairness definitions and their politics. In Proceedings of the Conference on Fairness, Accountability, and Transparency, New York, USA. 3.Google Scholar
[73] Netemeyer Richard G., Dobolyi David G., Abbasi Ahmed, Clifford Gari, and Taylor Herman. 2020. Health literacy, health numeracy, and trust in doctor: Effects on key patient health outcomes. Journal of Consumer Affairs 54, 1 (2020), 3–42.Google ScholarCross Ref
[74] A. Ng. 2011. Advice for applying machine learning. Stanford Univ., Stanford, CA, USA, Tech. Rep., 2011. [Online]. Available: http://cs229.stanford.edu/materials/ML-advice.pdfGoogle Scholar
[75] Harrie Oosterhuis. 2023. Doubly robust estimation for correcting position bias in click feedback for unbiased learning to rank. ACM Trans. Inf. Syst. 41, 3, Article 61 (July 2023), 33 pages. Google ScholarDigital Library
[76] Pal Aditya, Harper F. Maxwell, and Konstan Joseph A.. 2012. Exploring question selection bias to identify experts and potential experts in community question answering. ACM Transactions on Information Systems 30, 2, Article 10 (2012), 28 pages. DOI:Google ScholarDigital Library
[77] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.Google ScholarCross Ref
[78] Pessach Dana and Shmueli Erez. 2022. A review on fairness in machine learning. ACM Computing Surveys 55, 3 (2022), 1–44.Google ScholarDigital Library
[79] Provost Foster and Fawcett Tom. 2013. Data Science for Business: What You Need to Know About Data Mining and Data-analytic Thinking. O’Reilly Media, Inc.Google Scholar
[80] Rakova Bogdana, Yang Jingying, Cramer Henriette, and Chowdhury Rumman. 2021. Where responsible AI meets reality: Practitioner perspectives on enablers for shifting organizational practices. Proceedings of the ACM on Human-Computer Interaction 5, CSCW1 (2021), 1–23.Google ScholarDigital Library
[81] Ren Pengzhen, Xiao Yun, Chang Xiaojun, Huang Po-Yao, Li Zhihui, Chen Xiaojiang, and Wang Xin. 2021. A comprehensive survey of neural architecture search: Challenges and solutions. ACM Computing Surveys 54, 4 (2021), 1–34.Google ScholarDigital Library
[82] Rudin Cynthia. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1, 5 (2019), 206–215.Google ScholarCross Ref
[83] Tetsuya Sakai, Jin Young Kim, and Inho Kang. 2023. A versatile framework for evaluating ranked lists in terms of group fairness and relevance. ACM Trans. Inf. Syst. 42, 1, Article 11 (January 2024), 36 pages. Google ScholarDigital Library
[84] Sap Maarten, Card Dallas, Gabriel Saadia, Choi Yejin, and Smith Noah A.. 2019. The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1668–1678.Google ScholarCross Ref
[85] Shah Deven Santosh, Schwartz H. Andrew, and Hovy Dirk. 2020. Predictive biases in natural language processing models: A conceptual framework and overview. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5248–5264. DOI:Google ScholarCross Ref
[86] Galit Shmueli. 2010. To explain or to predict? Statist. Sci. 25, 3 (2010), 289–310. Google ScholarCross Ref
[87] Galit Shmueli and Otto Koppius. 2011. Predictive analytics in information systems research. Management Information Systems Quarterly 35, 3 (2011), 553–572.Google Scholar
[88] Herbert A. Simon. 1998. The science of design: Creating the artificial. Design Issues 4, 1/2 (1988), 67–82. Google ScholarCross Ref
[89] Sriram Somanchi, Ahmed Abbasi, Ken Kelley, David Dobolyi, and Ted Tao Yuan. 2023. Examining user heterogeneity in digital experiments. ACM Trans. Inf. Syst. 41, 4, Article 100 (October 2023), 34 pages. Google ScholarDigital Library
[90] Steed Ryan, Panda Swetasudha, Kobren Ari, and Wick Michael. 2022. Upstream mitigation is not all you need: Testing the bias transfer hypothesis in pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3524–3542.Google ScholarCross Ref
[91] Subramanian Shivashankar, Han Xudong, Baldwin Timothy, Cohn Trevor, and Frermann Lea. 2021. Evaluating debiasing techniques for intersectional biases. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2492–2498.Google ScholarCross Ref
[92] Tan Yi Chern and Celis L. Elisa. 2019. Assessing social and intersectional biases in contextualized word representations. Advances in Neural Information Processing Systems 32 (2019).Google Scholar
[93] Tausczik Yla R. and Pennebaker James W.. 2010. The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology 29, 1 (2010), 24–54.Google ScholarCross Ref
[94] Mike H. M. Teodorescu, et al. 2021. Failures of fairness in automation require a deeper understanding of human-ML augmentation. Management Information Systems Quarterly 45, 3 (2021), 1483–1500.Google Scholar
[95] Urgo Kelsey and Arguello Jaime. 2022. Understanding the “Pathway” towards a searcher’s learning objective. ACM Transactions on Information Systems 40, 4 (2022), 1–43.Google ScholarDigital Library
[96] Yifan Wang, Weizhi Ma, Min Zhang, Yiqun Liu, and Shaoping Ma. 2023. A survey on the fairness of recommender systems. ACM Trans. Inf. Syst. 41, 3, Article 52 (July 2023), 43 pages. Google ScholarDigital Library
[97] White Ryen W. and Horvitz Eric. 2009. Cyberchondria: Studies of the escalation of medical concerns in web search. ACM Transactions on Information Systems 27, 4 (2009), 1–37.Google ScholarDigital Library
[98] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 38–45.Google ScholarCross Ref
[99] J. M. Wooldridge. 2009. Omitted variable bias: the simple case. Introductory Econometrics: A Modern Approach. Mason, OH: Cengage Learning. 89–93.Google Scholar
[100] Wu Haolun, Ma Chen, Mitra Bhaskar, Diaz Fernando, and Liu Xue. 2023. A multi-objective optimization framework for multi-stakeholder fairness-aware recommendation. ACM Transactions on Information Systems 41, 2, Article 47 (2023), 29 pages. DOI:Google ScholarDigital Library
[101] Wu Le, Chen Lei, Shao Pengyang, Hong Richang, Wang Xiting, and Wang Meng. 2021. Learning fair representations for recommendation: A graph-based perspective. In Proceedings of the Web Conference 2021. 2198–2208.Google ScholarDigital Library
[102] Heng Xu and Nan Zhang. 2022. Goal orientation for fair machine learning algorithms (December 12, 2022). Available at SSRN: https://ssrn.com/abstract=4300581Google Scholar
[103] Yang Forest, Cisse Mouhamadou, and Koyejo Sanmi. 2020. Fairness with overlapping groups; A probabilistic perspective. Advances in Neural Information Processing Systems 33 (2020), 4067–4078.Google Scholar
[104] Yang Kai, Lau Raymond Y. K., and Abbasi Ahmed. 2023. Getting personal: A deep learning artifact for text-based measurement of personality. Information Systems Research 34, 1 (2023), 194–222.Google ScholarDigital Library
[105] Yarkoni Tal and Westfall Jacob. 2017. Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science 12, 6 (2017), 1100–1122.Google ScholarCross Ref
[106] Han Zhang, Zhicheng Dou, Yutao Zhu, and Ji-Rong Wen. 2023. Contrastive learning for legal judgment prediction. ACM Trans. Inf. Syst. 41, 4, Article 113 (October 2023), 25 pages. Google ScholarDigital Library
[107] N. Zhang and H. Xu. 2024. Fairness of ratemaking for catastrophe insurance: Lessons from machine learning. Information Systems Research, Forthcoming.Google Scholar
[108] Z. Zhao et al. 2023. Popularity bias is not always evil: Disentangling benign and harmful bias for recommendation. In IEEE Transactions on Knowledge and Data Engineering, 35, 10 (2023), 9920–9931, 1 Oct. 2023. DOI:Google ScholarDigital Library
[109] Zhou Fan, Mao Yuzhou, Yu Liu, Yang Yi, and Zhong Ting. 2023. Causal-debias: Unifying debiasing in pretrained language models and fine-tuning via causal invariant learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 4227–4241. Retrieved from https://aclanthology.org/2023.acl-long.232Google ScholarCross Ref

Index Terms

Should Fairness be a Metric or a Model? A Model-based Framework for Assessing Bias in Machine Learning Pipelines
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
2. Social and professional topics
  1. User characteristics

Recommendations

Fairness-Aware Machine Learning: Practical Challenges and Lessons Learned
WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Researchers and practitioners from different disciplines have highlighted the ethical and legal challenges posed by the use of machine learned models and data-driven systems, and the potential for such systems to discriminate against certain population ...
Read More
Silva: Interactively Assessing Machine Learning Fairness Using Causality
CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems

Machine learning models risk encoding unfairness on the part of their developers or data sources. However, assessing fairness is challenging as analysts might misidentify sources of bias, fail to notice them, or misapply metrics. In this paper we ...
Read More
Framework for Bias Detection in Machine Learning Models: A Fairness Approach
WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data Mining

The research addresses bias and inequity in binary classification problems in machine learning. Despite existing ethical frameworks for artificial intelligence, detailed guidance on practices and tech niques to address these issues is lacking. The main ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 42, Issue 4
July 2024
751 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3613639
Issue’s Table of Contents

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 March 2024
- Online AM: 23 January 2024
- Accepted: 3 January 2024
- Revised: 20 November 2023
- Received: 30 March 2023
Published in tois Volume 42, Issue 4

Check for updates
Author Tags
Machine learning fairness
algorithmic bias
model framework
prediction and explanation
AI governance
machine learning pipelines
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 428
  Total Downloads
- Downloads (Last 12 months)428
- Downloads (Last 6 weeks)186
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Should Fairness be a Metric or a Model? A Model-based Framework for Assessing Bias in Machine Learning Pipelines

ACM Transactions on Information Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Fairness-Aware Machine Learning: Practical Challenges and Lessons Learned

Silva: Interactively Assessing Machine Learning Fairness Using Causality

Framework for Bias Detection in Machine Learning Models: A Fairness Approach

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

Should Fairness be a Metric or a Model? A Model-based Framework for Assessing Bias in Machine Learning Pipelines

ACM Transactions on Information Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Fairness-Aware Machine Learning: Practical Challenges and Lessons Learned

Silva: Interactively Assessing Machine Learning Fairness Using Causality

Framework for Bias Detection in Machine Learning Models: A Fairness Approach

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media