A flexible non-monotonic discretization method for pre-processing in supervised learning,Pattern Recognition Letters

当前位置： X-MOL 学术 › Pattern Recogn. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A flexible non-monotonic discretization method for pre-processing in supervised learning
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2024-03-27 , DOI: 10.1016/j.patrec.2024.03.024
Hatice Şenozan , Banu Soylu

Discretization is one of the important pre-processing steps for supervised learning. Discretizing attributes helps to simplify the data and make it easier to understand and analyze by reducing the number of values. It can provide a better representation of knowledge and thus help improve the accuracy of a classifier. However, to minimize the information loss, it is important to consider the characteristics of the data. Most approaches assume that the values of a continuous attribute are monotone with respect to the probability of belonging to a particular class. In other words, it is assumed that increasing or decreasing the value of the attribute leads to a proportional increase or decrease in the classification score. This assumption may not always be valid for all attributes of data. In this study, we present entropy-based, flexible discretization strategies capable of capturing the non-monotonicity of the attribute values. The algorithm can adjust the number of cut points and values depending on the characteristics of the data. It does not require setting of any hyper-parameter or threshold. Extensive experiments on different datasets have shown that the proposed discretizers significantly improve the performance of classifiers, especially on complex and high-dimensional data sets.

中文翻译：

监督学习中一种灵活的非单调离散化预处理方法

离散化是监督学习的重要预处理步骤之一。离散化属性有助于简化数据，并通过减少值的数量使其更易于理解和分析。它可以提供更好的知识表示，从而有助于提高分类器的准确性。然而，为了最大限度地减少信息损失，考虑数据的特征很重要。大多数方法假设连续属性的值相对于属于特定类别的概率是单调的。换句话说，假设增加或减少属性值会导致分类分数成比例增加或减少。该假设可能并不总是对数据的所有属性都有效。在本研究中，我们提出了基于熵的灵活离散化策略，能够捕获属性值的非单调性。该算法可以根据数据的特征调整分割点的数量和值。它不需要设置任何超参数或阈值。对不同数据集的大量实验表明，所提出的离散器显着提高了分类器的性能，特别是在复杂和高维数据集上。

更新日期：2024-03-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>