当前位置: X-MOL 学术Nat. Lang. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An unsupervised perplexity-based method for boilerplate removal
Natural Language Engineering ( IF 2.5 ) Pub Date : 2023-02-21 , DOI: 10.1017/s1351324923000049
Marcos Fernández-Pichel , Manuel Prada-Corral , David E. Losada , Juan C. Pichel , Pablo Gamallo

The availability of large web-based corpora has led to significant advances in a wide range of technologies, including massive retrieval systems or deep neural networks. However, leveraging this data is challenging, since web content is plagued by the so-called boilerplate: ads, incomplete or noisy text and rests of the navigation structure, such as menus or navigation bars. In this work, we present a novel and efficient approach to extract useful and well-formed content from web-scraped data. Our approach takes advantage of Language Models and their implicit knowledge about correctly formed text, and we demonstrate here that perplexity is a valuable artefact that can contribute in terms of effectiveness and efficiency. As a matter of fact, the removal of noisy parts leads to lighter AI or search solutions that are effective and entail important reductions in resources spent. We exemplify here the usefulness of our method with two downstream tasks, search and classification, and a cleaning task. We also provide a Python package with pre-trained models and a web demo demonstrating the capabilities of our approach.



中文翻译:

一种基于困惑度的无监督样板去除方法

基于网络的大型语料库的可用性导致了各种技术的重大进步,包括大规模检索系统或深度神经网络。然而,利用这些数据具有挑战性,因为网络内容受到所谓的样板文件的困扰:广告、不完整或嘈杂的文本以及导航结构的其余部分,例如菜单或导航栏。在这项工作中,我们提出了一种新颖而有效的方法,可以从网络抓取的数据中提取有用且格式良好的内容。我们的方法利用了语言模型及其关于正确形成的文本的隐含知识,我们在这里证明困惑是一种有价值的人工制品,可以在有效性和效率方面做出贡献。事实上,去除嘈杂的部分会带来更轻的人工智能或搜索解决方案,这些解决方案是有效的,并且会显着减少资源消耗。我们在这里通过两个下游任务(搜索和分类以及清理任务)举例说明了我们的方法的有用性。我们还提供了一个带有预训练模型的 Python 包和一个展示我们方法功能的 Web 演示。

更新日期:2023-02-21
down
wechat
bug