当前位置: X-MOL 学术Inf. Retrieval J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Measurement of clustering effectiveness for document collections
Information Retrieval Journal ( IF 2.5 ) Pub Date : 2022-01-10 , DOI: 10.1007/s10791-021-09401-8
Meng Yuan 1 , Justin Zobel 1 , Pauline Lin 1
Affiliation  

Clustering of the contents of a document corpus is used to create sub-corpora with the intention that they are expected to consist of documents that are related to each other. However, while clustering is used in a variety of ways in document applications such as information retrieval, and a range of methods have been applied to the task, there has been relatively little exploration of how well it works in practice. Indeed, given the high dimensionality of the data it is possible that clustering may not always produce meaningful outcomes. In this paper we use a well-known clustering method to explore a variety of techniques, existing and novel, to measure clustering effectiveness. Results with our new, extrinsic techniques based on relevance judgements or retrieved documents demonstrate that retrieval-based information can be used to assess the quality of clustering, and also show that clustering can succeed to some extent at gathering together similar material. Further, they show that intrinsic clustering techniques that have been shown to be informative in other domains do not work for information retrieval. Whether clustering is sufficiently effective to have a significant impact on practical retrieval is unclear, but as the results show our measurement techniques can effectively distinguish between clustering methods.



中文翻译:

文档集合的聚类有效性测量

文档语料库内容的聚类用于创建子语料库,目的是希望它们由彼此相关的文档组成。然而,虽然聚类在信息检索等文档应用程序中以多种方式使用,并且一系列方法已应用于该任务,但对其在实践中的工作情况的探索相对较少。实际上,鉴于数据的高维性,聚类可能并不总是产生有意义的结果。在本文中,我们使用一种众所周知的聚类方法来探索各种现有的和新的技术来衡量聚类的有效性。结果与我们的新,基于相关性判断或检索文档的外在技术表明,基于检索的信息可用于评估聚类的质量,并且还表明聚类可以在一定程度上成功地将相似材料聚集在一起。此外,他们表明,在其他领域已被证明具有丰富信息的内在聚类技术不适用于信息检索。聚类是否足够有效以对实际检索产生重大影响尚不清楚,但结果表明我们的测量技术可以有效地区分聚类方法。他们表明,在其他领域已被证明具有丰富信息的内在聚类技术不适用于信息检索。聚类是否足够有效以对实际检索产生重大影响尚不清楚,但结果表明我们的测量技术可以有效地区分聚类方法。他们表明,在其他领域已被证明具有丰富信息的内在聚类技术不适用于信息检索。聚类是否足够有效以对实际检索产生重大影响尚不清楚,但结果表明我们的测量技术可以有效地区分聚类方法。

更新日期:2022-01-11
down
wechat
bug