当前位置: X-MOL 学术Comput. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Online short text clustering using infinite extensions of discrete mixture models
Computational Intelligence ( IF 2.8 ) Pub Date : 2023-07-10 , DOI: 10.1111/coin.12593
Samar Hannachi 1 , Fatma Najar 1 , Hafsa Ennajari 1 , Nizar Bouguila 1

Short text clustering is one of the fundamental tasks in natural language processing. Different from traditional documents, short texts are ambiguous and sparse due to their short form and the lack of recurrence in word usage from one text to another, making it very challenging to apply conventional machine learning algorithms directly. In this article, we propose two novel approaches for short texts clustering: collapsed Gibbs sampling infinite generalized Dirichlet multinomial mixture model infinite GSGDMM) and collapsed Gibbs sampling infinite Beta-Liouville multinomial mixture model (infinite GSBLMM). We adopt two flexible and practical priors to the multinomial distribution where in the first one the generalized Dirichlet distribution is integrated, while the second one is based on the Beta-Liouville distribution. We evaluate the proposed approaches on two famous benchmark datasets, namely, Google News and Tweet. The experimental results demonstrate the effectiveness of our models compared to basic approaches that use Dirichlet priors. We further propose to improve the performance of our methods with an online clustering procedure. We also evaluate the performance of our methods for the outlier detection task, in which we achieve accurate results.



短文本聚类是自然语言处理的基本任务之一。与传统文档不同,短文本由于其简短的形式以及从一个文本到另一个文本的词语使用缺乏重复性而具有歧义和稀疏性,这使得直接应用传统的机器学习算法非常具有挑战性。在本文中,我们提出了两种新颖的短文本聚类方法:折叠吉布斯采样无限广义狄利克雷多项混合模型无限GSGDMM)和折叠吉布斯采样无限Beta-Liouville多项混合模型(无限GSBLMM)。我们对多项分布采用两个灵活且实用的先验,其中第一个先验对广义狄利克雷分布进行积分,而第二个先验基于 Beta-Liouville 分布。我们在两个著名的基准数据集(即 Google News 和 Tweet)上评估了所提出的方法。实验结果证明了我们的模型与使用狄利克雷先验的基本方法相比的有效性。我们进一步建议通过在线聚类过程来提高我们方法的性能。我们还评估了我们的方法在异常值检测任务中的性能,在该任务中我们获得了准确的结果。