An efficient feature selection framework based on information theory for high dimensional data,Applied Soft Computing

当前位置： X-MOL 学术 › Appl. Soft Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An efficient feature selection framework based on information theory for high dimensional data
Applied Soft Computing ( IF 8.7 ) Pub Date : 2021-07-23 , DOI: 10.1016/j.asoc.2021.107729
G. Manikandan ₁ , S. Abirami ₂

Affiliation

Feature selection plays a vital role in many fields, particularly in pattern recognition and bioinformatics, for selecting informative and relevant features from high dimensional datasets. The increase in dimensionality of data along with the existence of redundant and irrelevant features leads to challenging performance issues when processing and analysing the data. In this paper, an effective feature selection technique called mutual information and Monte Carlo based feature selection (MIMCFS) is proposed. It comprises of two stages. The first stage aims to select predominant features from the high dimensional data. The second stage involves elimination of redundant features that were selected in the first stage. For the purpose of implementing the first stage, a new feature selection strategy based on the approximate Markov blanket and the concept of mutual information is proposed to find out irrelevant and redundant features. In second stage, to avoid misjudgement of redundant features as relevant features, a new strategy based on Monte Carlo tree search technique is proposed in order to completely eradicate redundant features and to improve feature interaction. For experimental evaluation, eight benchmark microarray datasets including imbalanced ones pertaining to cancer analysis are used. Further, in order to compare and justify the performance of the proposed feature selection method, seven state-of-art feature selection techniques namely CFS, Relief, DISR, JMI, CMIM and CMI are employed. The outputs from these feature selection techniques are provided to three standard classifiers namely Naive Bayes, SVM and C4.5 in order to assess the significance of the selected features in building classification models. 10-fold cross validation is adopted to evaluate the classifiers. Accuracy, precision, recall, f-measure, standard deviation, statistical significance metrics are measured to quantify the classifier performance. Experimental results demonstrate the outstanding performance of the proposed algorithm when compared to that of the standard existing methods.

中文翻译：

基于信息论的高维数据高效特征选择框架

特征选择在许多领域中起着至关重要的作用，特别是在模式识别和生物信息学中，用于从高维数据集中选择信息和相关特征。数据维度的增加以及冗余和不相关特征的存在导致在处理和分析数据时遇到具有挑战性的性能问题。在本文中，提出了一种称为互信息和基于蒙特卡洛的特征选择（MIMCFS）的有效特征选择技术。它包括两个阶段。第一阶段旨在从高维数据中选择主要特征。第二阶段涉及消除在第一阶段选择的冗余特征。为了实施第一阶段，提出了一种基于近似马尔可夫毯和互信息概念的新特征选择策略，以找出不相关和冗余的特征。在第二阶段，为了避免将冗余特征误判为相关特征，提出了一种基于蒙特卡罗树搜索技术的新策略，以彻底消除冗余特征并改善特征交互。对于实验评估，使用了八个基准微阵列数据集，包括与癌症分析有关的不平衡数据集。此外，为了比较和证明所提出的特征选择方法的性能，采用了七种最先进的特征选择技术，即 CFS、Relief、DISR、JMI、CMIM 和 CMI。这些特征选择技术的输出提供给三个标准分类器，即朴素贝叶斯、SVM 和 C4.5，以评估所选特征在构建分类模型中的重要性。采用 10 折交叉验证来评估分类器。测量准确度、精确度、召回率、f-measure、标准偏差、统计显着性指标以量化分类器性能。实验结果表明，与现有的标准方法相比，所提出的算法具有出色的性能。测量统计显着性指标以量化分类器性能。实验结果表明，与现有的标准方法相比，所提出的算法具有出色的性能。测量统计显着性指标以量化分类器性能。实验结果表明，与现有的标准方法相比，所提出的算法具有出色的性能。

更新日期：2021-07-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>