Conditional feature importance for mixed data,AStA Advances in Statistical Analysis

当前位置： X-MOL 学术 › AStA. Adv. Stat. Anal. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Conditional feature importance for mixed data
AStA Advances in Statistical Analysis ( IF 1.4 ) Pub Date : 2023-04-29 , DOI: 10.1007/s10182-023-00477-9
Kristin Blesch , David S. Watson , Marvin N. Wright

Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analysing a variable’s importance before and after adjusting for covariates—i.e., between marginal and conditional measures. Our work draws attention to this rarely acknowledged, yet crucial distinction and showcases its implications. We find that few methods are available for testing conditional FI and practitioners have hitherto been severely restricted in method application due to mismatched data requirements. Most real-world data exhibits complex feature dependencies and incorporates both continuous and categorical features (i.e., mixed data). Both properties are oftentimes neglected by conditional FI measures. To fill this gap, we propose to combine the conditional predictive impact (CPI) framework with sequential knockoff sampling. The CPI enables conditional FI measurement that controls for any feature dependencies by sampling valid knockoffs—hence, generating synthetic data with similar statistical properties—for the data to be analysed. Sequential knockoffs were deliberately designed to handle mixed data and thus allow us to extend the CPI approach to such datasets. We demonstrate through numerous simulations and a real-world example that our proposed workflow controls type I error, achieves high power, and is in-line with results given by other conditional FI measures, whereas marginal FI metrics can result in misleading interpretations. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.

中文翻译：

混合数据的条件特征重要性

尽管特征重要性 (FI) 度量在可解释机器学习中很受欢迎，但很少讨论这些方法的统计充分性。从统计学的角度来看，一个主要的区别是在调整协变量之前和之后分析变量的重要性——即边际和条件之间的区别措施。我们的工作引起了人们对这一鲜为人知但至关重要的区别的关注，并展示了其含义。我们发现很少有方法可用于测试条件 FI，并且由于数据要求不匹配，从业者迄今为止在方法应用方面受到严格限制。大多数真实世界的数据都表现出复杂的特征依赖性，并结合了连续特征和分类特征（即混合数据）。有条件的 FI 措施通常会忽略这两个属性。为了填补这一空白，我们建议将条件预测影响 (CPI) 框架与顺序仿冒抽样相结合。CPI 启用条件 FI 测量，通过对有效仿冒品进行采样来控制任何特征依赖性——因此，生成具有相似统计特性的合成数据——用于要分析的数据。顺序仿冒品被特意设计用于处理混合数据，从而使我们能够将 CPI 方法扩展到此类数据集。我们通过大量模拟和一个真实世界的例子证明，我们提出的工作流控制了 I 型错误，实现了高功效，并且与其他条件 FI 措施给出的结果一致，而边际 FI 指标可能会导致误导性解释。我们的研究结果强调了为混合数据开发统计充分的专门方法的必要性。而边际 FI 指标可能会导致误导性的解释。我们的研究结果强调了为混合数据开发统计充分的专门方法的必要性。而边际 FI 指标可能会导致误导性的解释。我们的研究结果强调了为混合数据开发统计充分的专门方法的必要性。

更新日期：2023-04-29

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>