当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions
arXiv - CS - Information Retrieval Pub Date : 2024-03-22 , DOI: arxiv-2403.15279
Max Dallabetta, Conrad Dobberstein, Adrian Breiding, Alan Akbik

This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work. The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.

中文翻译:

Fundus:针对高质量提取而优化的简单易用的新闻刮刀

本文介绍了Fundus,一个用户友好的新闻抓取工具,使用户只需几行代码即可获取数百万篇高质量的新闻文章。与现有的新闻抓取工具不同,我们使用手动制作的定制内容提取器,这些提取器是专门根据每种支持的在线报纸的格式指南量身定制的。这使我们能够优化抓取质量,使检索到的新闻文章在文本上完整且没有 HTML 伪影。此外,我们的框架将爬行(从网络或大型网络档案中检索 HTML)和内容提取结合到一个管道中。通过为预定义的报纸集合提供统一的界面,我们的目标是使 Fundus 能够广泛使用,甚至对于非技术用户也是如此。本文概述了该框架,讨论了我们的设计选择,并与其他流行的新闻抓取工具进行了比较评估。我们的评估表明,与之前的工作相比,Fundus 产生的提取质量明显更高(完整且无伪影的新闻文章)。该框架可在 GitHub 上找到,网址为 https://github.com/flairNLP/fundus,并且可以使用 pip 轻松安装。
更新日期:2024-03-25
down
wechat
bug