当前位置: X-MOL 学术Data Min. Knowl. Discov. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A comparative study of methods for estimating model-agnostic Shapley value explanations
Data Mining and Knowledge Discovery ( IF 4.8 ) Pub Date : 2024-03-29 , DOI: 10.1007/s10618-024-01016-z
Lars Henry Berge Olsen , Ingrid Kristine Glad , Martin Jullum , Kjersti Aas

Shapley values originated in cooperative game theory but are extensively used today as a model-agnostic explanation framework to explain predictions made by complex machine learning models in the industry and academia. There are several algorithmic approaches for computing different versions of Shapley value explanations. Here, we consider Shapley values incorporating feature dependencies, referred to as conditional Shapley values, for predictive models fitted to tabular data. Estimating precise conditional Shapley values is difficult as they require the estimation of non-trivial conditional expectations. In this article, we develop new methods, extend earlier proposed approaches, and systematize the new refined and existing methods into different method classes for comparison and evaluation. The method classes use either Monte Carlo integration or regression to model the conditional expectations. We conduct extensive simulation studies to evaluate how precisely the different method classes estimate the conditional expectations, and thereby the conditional Shapley values, for different setups. We also apply the methods to several real-world data experiments and provide recommendations for when to use the different method classes and approaches. Roughly speaking, we recommend using parametric methods when we can specify the data distribution almost correctly, as they generally produce the most accurate Shapley value explanations. When the distribution is unknown, both generative methods and regression models with a similar form as the underlying predictive model are good and stable options. Regression-based methods are often slow to train but quickly produce the Shapley value explanations once trained. The vice versa is true for Monte Carlo-based methods, making the different methods appropriate in different practical situations.



中文翻译:

估计与模型无关的 Shapley 值解释的方法的比较研究

沙普利值起源于合作博弈论,但如今被广泛用作与模型无关的解释框架,以解释工业界和学术界复杂机器学习模型做出的预测。有多种算法方法可用于计算不同版本的 Shapley 值解释。在这里,我们考虑结合特征依赖性的 Shapley 值,称为条件 Shapley 值,用于拟合表格数据的预测模型。估计精确的条件 Shapley 值很困难,因为它们需要估计不平凡的条件期望。在本文中,我们开发了新方法,扩展了早期提出的方法,并将新的改进方法和现有方法系统化为不同的方法类别以进行比较和评估。方法类使用蒙特卡罗积分或回归来对条件期望进行建模。我们进行了广泛的模拟研究,以评估不同方法类估计条件期望的精确程度,从而评估不同设置的条件沙普利值。我们还将这些方法应用于几个现实世界的数据实验,并就何时使用不同的方法类和方法提供建议。粗略地说,当我们可以几乎正确地指定数据分布时,我们建议使用参数方法,因为它们通常会产生最准确的 Shapley 值解释。当分布未知时,生成方法和与基础预测模型具有相似形式的回归模型都是良好且稳定的选择。基于回归的方法通常训练速度很慢,但训练后很快就会产生 Shapley 值解释。对于基于蒙特卡罗的方法,反之亦然,使得不同的方法适合不同的实际情况。

更新日期:2024-03-30
down
wechat
bug