当前位置: X-MOL 学术Adv. Theory Simul. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ANDClust: An Adaptive Neighborhood Distance-Based Clustering Algorithm to Cluster Varying Density and/or Neck-Typed Datasets
Advanced Theory and Simulations ( IF 3.3 ) Pub Date : 2024-03-08 , DOI: 10.1002/adts.202301113
Ali Şenol 1
Affiliation  

Although density-based clustering algorithms can successfully define clusters in arbitrary shapes, they encounter issues if the dataset has varying densities or neck-typed clusters due to the requirement for precise distance parameters, such as eps parameter of DBSCAN. These approches assume that data density is homogenous, but this is rarely the case in practice. In this study, a new clustering algorithm named ANDClust (Adaptive Neighborhood Distance-based Clustering Algorithm) is proposed to handle datasets with varying density and/or neck-typed clusters. The algorithm consists of three parts. The first part uses Multivariate Kernel Density Estimation (MulKDE) to find the dataset's peak points, which are the start points for the Minimum Spanning Tree (MST) to construct clusters in the second part. Lastly, an Adaptive Neighborhood Distance (AND) ratio is used to weigh the distance between the data pairs. This method enables this approach to support inter-cluster and intra-cluster density varieties by acting as if the distance parameter differs for each data of the dataset. ANDClust is tested on synthetic and real datasets to reveal its efficiency. The algorithm shows superior clustering quality in a good run-time compared to its competitors. Moreover, ANDClust could effectively define clusters of arbitrary shapes and process high-dimensional, imbalanced datasets may have outliers.

中文翻译:

ANDClust:一种基于自适应邻域距离的聚类算法,用于对不同密度和/或颈型数据集进行聚类

虽然基于密度的聚类算法可以成功定义任意形状的簇,但如果数据集具有不同的密度或颈型簇,那么由于需要精确的距离参数(例如 DBSCAN 的 eps 参数),它们会遇到问题。这些方法假设数据密度是同质的,但实际情况很少如此。在本研究中,提出了一种名为 ANDClust(基于自适应邻域距离的聚类算法)的新聚类算法来处理具有不同密度和/或颈型聚类的数据集。该算法由三部分组成。第一部分使用多元核密度估计 (MulKDE) 来查找数据集的峰值点,这是第二部分中最小生成树 (MST) 构建聚类的起点。最后,使用自适应邻域距离(AND)比率来衡量数据对之间的距离。该方法使该方法能够支持簇间和簇内密度变化,就像数据集的每个数据的距离参数不同一样。 ANDClust 在合成数据集和真实数据集上进行了测试,以揭示其效率。与竞争对手相比,该算法在良好的运行时间内表现出卓越的聚类质量。此外,ANDClust 可以有效地定义任意形状的簇,并处理可能存在异常值的高维、不平衡数据集。
更新日期:2024-03-08
down
wechat
bug