当前位置: X-MOL 学术IEEE Trans. Softw. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Test Input Prioritization for Machine Learning Classifiers
IEEE Transactions on Software Engineering ( IF 7.4 ) Pub Date : 2024-01-05 , DOI: 10.1109/tse.2024.3350019
Xueqi Dang 1 , Yinghua Li 1 , Mike Papadakis 1 , Jacques Klein 1 , Tegawendé F. Bissyandé 1 , Yves Le Traon 1
Affiliation  

Machine learning has achieved remarkable success across diverse domains. Nevertheless, concerns about interpretability in black-box models, especially within Deep Neural Networks (DNNs), have become pronounced in safety-critical fields like healthcare and finance. Classical machine learning (ML) classifiers, known for their higher interpretability, are preferred in these domains. Similar to DNNs, classical ML classifiers can exhibit bugs that could lead to severe consequences in practice. Test input prioritization has emerged as a promising approach to ensure the quality of an ML system, which prioritizes potentially misclassified tests so that such tests can be identified earlier with limited manual labeling costs. However, when applying to classical ML classifiers, existing DNN test prioritization methods are constrained from three perspectives: 1) Coverage-based methods are inefficient and time-consuming; 2) Mutation-based methods cannot be adapted to classical ML models due to mismatched model mutation rules; 3) Confidence-based methods are restricted to a single dimension when applying to binary ML classifiers, solely depending on the model's prediction probability for one class. To overcome the challenges, we propose MLPrior, a test prioritization approach specifically tailored for classical ML models. MLPrior leverages the characteristics of classical ML classifiers (i.e., interpretable models and carefully engineered attribute features) to prioritize test inputs. The foundational principles are: 1) tests more sensitive to mutations are more likely to be misclassified, and 2) tests closer to the model's decision boundary are more likely to be misclassified. Building on the first concept, we design mutation rules to generate two types of mutation features (i.e., model mutation features and input mutation features ) for each test. Drawing from the second notion, MLPrior generates attribute features of each test based on its attribute values, which can indirectly reveal the proximity between the test and the decision boundary. For each test, MLPrior combines all three types of features of it into a final vector. Subsequently, MLPrior employs a pre-trained ranking model to predict the misclassification probability of each test based on its final vector and ranks tests accordingly. We conducted an extensive study to evaluate MLPrior based on 185 subjects, encompassing natural datasets, mixed noisy datasets, and fairness datasets. The results demonstrate that MLPrior outperforms all the compared test prioritization approaches, with an average improvement of 14.74% $\sim$ 66.93% on natural datasets, 18.55% $\sim$ 67.73% on mixed noisy datasets, and 15.34% $\sim$ 62.72% on fairness datasets.

中文翻译:

测试机器学习分类器的输入优先级

机器学习在不同领域取得了显着的成功。然而,在医疗保健和金融等安全关键领域,对黑盒模型可解释性的担忧,尤其是深度神经网络 (DNN) 中的可解释性,已经变得越来越明显。经典机器学习 (ML) 分类器以其较高的可解释性而闻名,在这些领域受到青睐。与 DNN 类似,经典的 ML 分类器可能会出现错误,这些错误可能会在实践中导致严重后果。测试输入优先级已成为确保机器学习系统质量的一种有前途的方法,它对可能错误分类的测试进行优先级排序,以便可以通过有限的手动标记成本更早地识别此类测试。然而,当应用于经典的ML分类器时,现有的DNN测试优先级方法受到三个方面的限制:1)基于覆盖的方法效率低下且耗时;2)由于模型变异规则不匹配,基于变异的方法无法适应经典的ML模型;3)当应用于二元机器学习分类器时,基于置信度的方法仅限于单一维度,仅取决于模型对一类的预测概率。为了克服这些挑战,我们提出了 MLPrior,这是一种专门为经典 ML 模型量身定制的测试优先级方法。MLPrior 利用经典 ML 分类器的特征(即可解释模型和精心设计的属性特征)来确定测试输入的优先级。基本原则是:1)对突变更敏感的测试更有可能被错误分类,2)更接近模型决策边界的测试更有可能被错误分类。基于第一个概念,我们设计变异规则,为每个测试生成两种类型的变异特征(即模型变异特征输入变异特征 )。借鉴第二个概念,MLPrior根据每个测试的属性值生成其属性特征,这可以间接揭示测试与决策边界之间的接近程度。对于每个测试,MLPrior 将其所有三种类型的特征组合成最终向量。随后,MLPrior 采用预训练的排名模型,根据每个测试的最终向量预测误分类概率,并对测试进行相应的排名。我们基于 185 个受试者进行了一项广泛的研究来评估 MLPrior,其中包括自然数据集、混合噪声数据集和公平数据集。结果表明,MLPrior 优于所有比较的测试优先级方法,平均提高了 14.74% $\sim$ 自然数据集 66.93%,18.55% $\sim$ 在混合噪声数据集上为 67.73%,在混合噪声数据集上为 15.34% $\sim$ 公平性数据集上的得分为 62.72%。
更新日期:2024-01-05
down
wechat
bug