当前位置: X-MOL 学术Multimed. Tools Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Impact of machine learning-based imputation techniques on medical datasets- a comparative analysis
Multimedia Tools and Applications ( IF 3.6 ) Pub Date : 2024-04-11 , DOI: 10.1007/s11042-024-19103-0
Shweta Tiwaskar , Mamoon Rashid , Prasad Gokhale

In the realm of medical datasets, particularly when considering diabetes, the occurrence of data incompleteness is a prevalent issue. Unveiling valuable patterns through medical data analysis is crucial for early and precise medical predictions. However, the quality of data and the proper handling of missing data hold significant significance. To address this challenge, imputation stands as a robust approach. The main goal of this paper aims to provide a comprehensive investigation into the effects brought about by Machine Learning (ML) based imputation techniques, specifically K Nearest Neighbor Imputation (KNNI), Multiple Imputation by Chained Equations (MICE), and MissForest. Results of all three techniques are compared with the complete dataset for five missing rates (10% to 50%), and evaluated using four categories of evaluation criteria i.e. (1) model performance, (2) imputation error rate (Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Coefficient of Determination (R^2) values), (3) Pearson correlation analysis and, (4) model selection basis (Bayesian information criterion (BIC), Akaike information criterion (AIC), values). Model performance includes accuracy, precision, recall, F1 score, and Matthews Correlation Coefficient (Mcoff) score of four ML classifiers viz. (a) Random Forest (RF), (b) Support vector machine (SVM), (c) AdaBoost, (d) XGBoost (XGB). For all missing rate cases, the MissForest technique is better than the KNNI and MICE in accuracy and Mcoff in 80% of cases, precision in 40% of cases, recall in 60% of cases, F1 score, MAE, RMSE, R^2 in 100% of cases, AIC in 80% of cases, and BIC values in 100% of cases. Also, the correlation analysis confirms that the MissForest imputation preserves association between the variables, like the complete dataset. Overall, our findings suggest that MissForest is a better machine learning-based imputation technique for handling missing data in diabetes research.



中文翻译:

基于机器学习的插补技术对医学数据集的影响 - 比较分析

在医疗数据集领域,特别是在考虑糖尿病时,数据不完整性的发生是一个普遍的问题。通过医学数据分析揭示有价值的模式对于早期和精确的医学预测至关重要。然而,数据的质量和缺失数据的正确处理具有重要意义。为了应对这一挑战,插补是一种强有力的方法。本文的主要目标是对基于机器学习 (ML) 的插补技术,特别是 K 最近邻插补 (KNNI)、链式方程多重插补 (MICE) 和 MissForest 所带来的影响进行全面的研究。将所有三种技术的结果与完整数据集的五个缺失率(10%至5​​0%)进行比较,并使用四类评估标准进行评估,即(1)模型性能,(2)插补误差率(平均绝对误差(MAE) )、均方根误差(RMSE)、决定系数(R^2)值)、(3)皮尔逊相关分析、(4)模型选择依据(贝叶斯信息准则(BIC)、赤池信息准则(AIC)、值)。模型性能包括四个 ML 分类器的准确度、精确度、召回率、F1 分数和马修斯相关系数 (Mcoff) 分数。 (a) 随机森林 (RF),(b) 支持向量机 (SVM),(c) AdaBoost,(d) XGBoost (XGB)。对于所有缺失率情况,MissForest 技术在准确度和 Mcoff(80% 情况下)、精确度(40% 情况下)、召回率(60% 情况下)、F1 分数、MAE、RMSE、R^2 方面均优于 KNNI 和 MICE 100% 情况下的 AIC 值,80% 情况下的 AIC 值,以及 100% 情况下的 BIC 值。此外,相关性分析证实 MissForest 插补保留了变量之间的关联,就像完整的数据集一样。总体而言,我们的研究结果表明,MissForest 是一种更好的基于机器学习的插补技术,用于处理糖尿病研究中的缺失数据。

更新日期:2024-04-12
down
wechat
bug