当前位置: X-MOL 学术ACM Trans. Inf. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Generalized Weak Supervision for Neural Information Retrieval
ACM Transactions on Information Systems ( IF 5.6 ) Pub Date : 2024-04-27 , DOI: 10.1145/3647639
Yen-Chieh Lien 1 , Hamed Zamani 1 , W. Bruce Croft 1
Affiliation  

Neural ranking models (NRMs) have demonstrated effective performance in several information retrieval (IR) tasks. However, training NRMs often requires large-scale training data, which is difficult and expensive to obtain. To address this issue, one can train NRMs via weak supervision, where a large dataset is automatically generated using an existing ranking model (called the weak labeler) for training NRMs. Weakly supervised NRMs can generalize from the observed data and significantly outperform the weak labeler. This paper generalizes this idea through an iterative re-labeling process, demonstrating that weakly supervised models can iteratively play the role of weak labeler and significantly improve ranking performance without using manually labeled data. The proposed Generalized Weak Supervision (GWS) solution is generic and orthogonal to the ranking model architecture. This paper offers four implementations of GWS: self-labeling, cross-labeling, joint cross- and self-labeling, and greedy multi-labeling. GWS also benefits from a query importance weighting mechanism based on query performance prediction methods to reduce noise in the generated training data. We further draw a theoretical connection between self-labeling and Expectation-Maximization. Our experiments on four retrieval benchmarks suggest that our implementations of GWS lead to substantial improvements compared to weak supervision if the weak labeler is sufficiently reliable.



中文翻译:

神经信息检索的广义弱监督

神经排序模型 (NRM) 在多项信息检索 (IR) 任务中表现出了有效的性能。然而,训练 NRM 通常需要大规模的训练数据,而获得这些数据既困难又昂贵。为了解决这个问题,可以通过弱监督来训练 NRM,其中使用现有的排名模型(称为弱标签器)自动生成大型数据集来训练 NRM。弱监督的 NRM 可以从观察到的数据中进行归纳,并显着优于弱标记器。本文通过迭代重新标记过程概括了这一想法,证明弱监督模型可以迭代地发挥弱标记器的作用,并在不使用手动标记数据的情况下显着提高排名性能。所提出的广义弱监督(GWS)解决方案是通用的并且与排名模型架构正交。本文提供了 GWS 的四种实现:自标记、交叉标记、联合交叉和自标记以及贪婪多重标记。 GWS 还受益于基于查询性能预测方法的查询重要性加权机制,以减少生成的训练数据中的噪声。我们进一步在自我标签和期望最大化之间建立了理论联系。我们对四个检索基准的实验表明,如果弱标记器足够可靠,那么与弱监督相比,我们的 GWS 实现会带来实质性的改进。

更新日期:2024-04-27
down
wechat
bug