NDPD: an improved initial centroid method of partitional clustering for big data mining,Journal of Advances in Management Research

当前位置： X-MOL 学术 › Journal of Advances in Management Research › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

NDPD: an improved initial centroid method of partitional clustering for big data mining
Journal of Advances in Management Research Pub Date : 2022-08-23 , DOI: 10.1108/jamr-07-2021-0242
Kamlesh Kumar Pandey , Diwakar Shukla

Purpose

The K-means (KM) clustering algorithm is extremely responsive to the selection of initial centroids since the initial centroid of clusters determines computational effectiveness, efficiency and local optima issues. Numerous initialization strategies are to overcome these problems through the random and deterministic selection of initial centroids. The random initialization strategy suffers from local optimization issues with the worst clustering performance, while the deterministic initialization strategy achieves high computational cost. Big data clustering aims to reduce computation costs and improve cluster efficiency. The objective of this study is to achieve a better initial centroid for big data clustering on business management data without using random and deterministic initialization that avoids local optima and improves clustering efficiency with effectiveness in terms of cluster quality, computation cost, data comparisons and iterations on a single machine.

Design/methodology/approach

This study presents the Normal Distribution Probability Density (NDPD) algorithm for big data clustering on a single machine to solve business management-related clustering issues. The NDPDKM algorithm resolves the KM clustering problem by probability density of each data point. The NDPDKM algorithm first identifies the most probable density data points by using the mean and standard deviation of the datasets through normal probability density. Thereafter, the NDPDKM determines K initial centroid by using sorting and linear systematic sampling heuristics.

Findings

The performance of the proposed algorithm is compared with KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms through Davies Bouldin score, Silhouette coefficient, SD Validity, S_Dbw Validity, Number of Iterations and CPU time validation indices on eight real business datasets. The experimental evaluation demonstrates that the NDPDKM algorithm reduces iterations, local optima, computing costs, and improves cluster performance, effectiveness, efficiency with stable convergence as compared to other algorithms. The NDPDKM algorithm minimizes the average computing time up to 34.83%, 90.28%, 71.83%, 92.67%, 69.53% and 76.03%, and reduces the average iterations up to 40.32%, 44.06%, 32.02%, 62.78%, 19.07% and 36.74% with reference to KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms.

Originality/value

The KM algorithm is the most widely used partitional clustering approach in data mining techniques that extract hidden knowledge, patterns and trends for decision-making strategies in business data. Business analytics is one of the applications of big data clustering where KM clustering is useful for the various subcategories of business analytics such as customer segmentation analysis, employee salary and performance analysis, document searching, delivery optimization, discount and offer analysis, chaplain management, manufacturing analysis, productivity analysis, specialized employee and investor searching and other decision-making strategies in business.

中文翻译：

NDPD：一种改进的大数据挖掘分区聚类初始质心方法

目的

K-means (KM) 聚类算法对初始质心的选择非常敏感，因为聚类的初始质心决定了计算有效性、效率和局部最优问题。许多初始化策略是通过随机和确定性选择初始质心来克服这些问题。随机初始化策略存在局部优化问题，聚类性能最差，而确定性初始化策略计算成本高。大数据聚类旨在降低计算成本，提高集群效率。

设计/方法/途径

本研究提出了用于单机大数据聚类的正态分布概率密度 (NDPD) 算法，以解决与业务管理相关的聚类问题。NDPDKM算法通过每个数据点的概率密度来解决KM聚类问题。NDPDKM算法首先通过正态概率密度利用数据集的均值和标准差来识别最可能的密度数据点。此后，NDPDKM通过使用排序和线性系统抽样启发式方法确定K个初始质心。

发现

通过 Davies Bouldin 分数、Silhouette 系数、SD 有效性、S_Dbw 有效性、迭代次数和 CPU 时间验证，将所提算法的性能与 KM、KM++、Var-Part、Murat-KM、Mean-KM 和 Sort-KM 算法进行比较八个真实业务数据集的索引。实验评估表明，与其他算法相比，NDPDKM算法减少了迭代次数、局部最优和计算成本，提高了集群性能、有效性和效率，收敛稳定。NDPDKM算法最小化平均计算时间达34.83%、90.28%、71.83%、92.67%、69.53%和76.03%，平均迭代次数减少达40.32%、44.06%、32.02%、62.78%、19.07%和36.74% 参考了 KM、KM++、Var-Part、Murat-KM、Mean-KM 和 Sort-KM 算法。

原创性/价值

KM 算法是数据挖掘技术中使用最广泛的分区聚类方法，它为业务数据中的决策策略提取隐藏的知识、模式和趋势。业务分析是大数据集群的应用之一，其中 KM 集群可用于业务分析的各种子类别，例如客户细分分析、员工薪资和绩效分析、文档搜索、交付优化、折扣和报价分析、牧师管理、制造分析、生产力分析、专门的员工和投资者搜索以及其他业务决策策略。

更新日期：2022-08-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>