Dense Text Retrieval Based on Pretrained Language Models: A Survey,ACM Transactions on Information Systems

当前位置： X-MOL 学术 › ACM Trans. Inf. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Dense Text Retrieval Based on Pretrained Language Models: A Survey
ACM Transactions on Information Systems ( IF 5.6 ) Pub Date : 2024-02-09 , DOI: 10.1145/3637870
Wayne Xin Zhao ₁ , Jing Liu ₂ , Ruiyang Ren ₁ , Ji-Rong Wen ₃

Affiliation

Text retrieval is a long-standing research topic on information seeking, where a system is required to return relevant information resources to user’s queries in natural language. From heuristic-based retrieval methods to learning-based ranking functions, the underlying retrieval models have been continually evolved with the ever-lasting technical innovation. To design effective retrieval models, a key point lies in how to learn text representations and model the relevance matching. The recent success of pretrained language models (PLM) sheds light on developing more capable text-retrieval approaches by leveraging the excellent modeling capacity of PLMs. With powerful PLMs, we can effectively learn the semantic representations of queries and texts in the latent representation space, and further construct the semantic matching function between the dense vectors for relevance modeling. Such a retrieval approach is called dense retrieval, since it employs dense vectors to represent the texts. Considering the rapid progress on dense retrieval, this survey systematically reviews the recent progress on PLM-based dense retrieval. Different from previous surveys on dense retrieval, we take a new perspective to organize the related studies by four major aspects, including architecture, training, indexing and integration, and thoroughly summarize the mainstream techniques for each aspect. We extensively collect the recent advances on this topic, and include 300+ reference papers. To support our survey, we create a website for providing useful resources, and release a code repository for dense retrieval. This survey aims to provide a comprehensive, practical reference focused on the major progress for dense text retrieval.

中文翻译：

基于预训练语言模型的密集文本检索：调查

文本检索是信息检索领域一个长期存在的研究课题，要求系统以自然语言的形式将相关信息资源返回给用户的查询。从基于启发式的检索方法到基于学习的排序功能，底层检索模型随着技术的不断创新而不断发展。要设计有效的检索模型，关键在于如何学习文本表示并建模相关性匹配。预训练语言模型 (PLM) 最近的成功揭示了如何利用 PLM 出色的建模能力来开发功能更强大的文本检索方法。借助强大的 PLM，我们可以有效地学习潜在表示空间中查询和文本的语义表示，并进一步构建密集向量之间的语义匹配函数以进行相关性建模。这种检索方法称为密集检索，因为它使用密集向量来表示文本。鉴于密集检索的快速进展，本次调查系统回顾了基于PLM的密集检索的最新进展。与以往密集检索的综述不同，我们以新的视角从架构、训练、索引和集成四大方面来组织相关研究，并深入总结了各个方面的主流技术。我们广泛收集了该主题的最新进展，包括 300 多篇参考论文。为了支持我们的调查，我们创建了一个网站来提供有用的资源，并发布了一个代码存储库以进行密集检索。本次调查旨在为密集文本检索的主要进展提供全面、实用的参考。

更新日期：2024-02-14

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>