当前位置: X-MOL 学术Pattern Recogn. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A simple and efficient filter feature selection method via document-term matrix unitization
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2024-03-19 , DOI: 10.1016/j.patrec.2024.02.025
Qing Li , Shuai Zhao , Tengjiao He , Jinming Wen

Text processing tasks commonly grapple with the challenge of high dimensionality. One of the most effective solutions to this challenge is to preprocess text data through feature selection methods. Feature selection can select the most advantageous features for subsequent operations (e.g., classification) from the native feature space of the text. This process effectively trims the feature space’s dimensionality, enhancing subsequent operations’ efficiency and accuracy. This paper proposes a straightforward and efficient filter feature selection method based on document-term matrix unitization (DTMU) for text processing. Diverging from previous filter feature selection methods that concentrate on scoring criteria definition, our method achieves more optimal feature selection by unitizing each column of the document-term matrix. This approach mitigates feature-to-feature influence and reinforces the role of the weighting proportion within the features. Subsequently, our scoring criterion subtracts the sum of weights for negative samples from positive samples and takes the absolute value. We conduct numerical experiments to compare DTMU with four advanced filter feature selection methods: max–min ratio metric, proportional rough feature selector, least loss, and relative discrimination criterion, along with two classical filter feature selection methods: Chi-square and information gain. The experiments are performed on four ten-thousand-dimensional feature space datasets: , , , and two thousand-dimensional feature space datasets: , , sourced from Amazon product reviews and movie reviews. Experimental findings demonstrate that DTMU selects more advantageous features for subsequent operations and achieves a higher dimensionality reduction rate than those of the other six methods used for comparison. Moreover, DTMU exhibits robust generalization capabilities across various classifiers and dimensional datasets. Notably, the average CPU time for a single run of DTMU is measured at 1.455 s.

中文翻译:

一种通过文档项矩阵单元化的简单高效的过滤特征选择方法

文本处理任务通常要应对高维的挑战。应对这一挑战最有效的解决方案之一是通过特征选择方法预处理文本数据。特征选择可以从文本的原生特征空间中选择对后续操作(例如分类)最有利的特征。该过程有效地修剪了特征空间的维数,提高了后续操作的效率和准确性。本文提出了一种基于文档术语矩阵单元化(DTMU)的简单高效的过滤特征选择方法,用于文本处理。与以前专注于评分标准定义的过滤特征选择方法不同,我们的方法通过统一文档术语矩阵的每一列来实现更优化的特征选择。这种方法减轻了特征对特征的影响,并加强了特征内权重比例的作用。随后,我们的评分标准从正样本中减去负样本的权重之和并取绝对值。我们进行数值实验,将 DTMU 与四种先进的滤波器特征选择方法进行比较:最大最小比度量、比例粗略特征选择器、最小损失和相对判别准则,以及两种经典的滤波器特征选择方法:卡方和信息增益。实验在四个万维特征空间数据集上进行: 、 、 和两千维特征空间数据集: 、 ,来自亚马逊产品评论和电影评论。实验结果表明,DTMU为后续操作选择了更有利的特征,并且比用于比较的其他六种方法获得了更高的降维率。此外,DTMU 在各种分类器和维度数据集上表现出强大的泛化能力。值得注意的是,单次运行 DTMU 的平均 CPU 时间为 1.455 秒。
更新日期:2024-03-19
down
wechat
bug