当前位置: X-MOL 学术Evol. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis
Evolutionary Computation ( IF 6.8 ) Pub Date : 2020-12-01 , DOI: 10.1162/evco_a_00264
Andrew Lensen 1 , Bing Xue 1 , Mengjie Zhang 1
Affiliation  

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.

中文翻译:

用于聚类的演化相似函数的遗传编程:表示和分析

聚类是一项困难且研究广泛的数据挖掘任务,文献中提出了多种聚类算法。几乎所有算法都使用诸如距离度量(例如,欧几里得距离)之类的相似性度量来决定将哪些实例分配给同一集群。这些相似性度量通常是预定义的,不能轻易地根据特定数据集的属性进行定制,这会导致所产生的集群的质量和可解释性受到限制。在本文中,我们提出了一种新方法,通过使用遗传编程为给定的聚类算法自动演化相似度函数。我们引入了一种新的基于遗传编程的方法,该方法自动选择一小部分特征(特征选择),然后使用各种函数(特征构造)将它们组合起来,以生成专为给定数据集设计的动态且灵活的相似度函数。我们演示了如何使用进化的相似性函数使用基于图的表示来执行聚类。一系列大型高维数据集的各种实验结果表明,与基准方法相比,所提出的方法可以实现更高、更一致的性能。我们进一步扩展了所提出的方法,通过使用多树方法自动生成多个互补的相似函数,从而进一步提高性能。
更新日期:2020-12-01
down
wechat
bug