Computer Science > Digital Libraries
[Submitted on 30 Jan 2024]
Title:Reading yesterday's news. Layout recognition by segmentation of historical newspaper pages
View PDFAbstract:Newspapers are important sources for historians interested in past societies' cultural values, social structures, and their changes. Since the 19th century, newspapers have been widely available and spread regionally. Today, historical newspapers are digitized but unavailable in a separate metadata-enhanced form. Machine-readable metadata, however, is a prerequisite for a mass statistical analysis of this source. This paper focuses on parsing the complex layout of historic newspaper pages, which today's machines do not understand well. We argue for using neural networks, which require detailed annotated data in large numbers. Our Bonn newspaper dataset consists of 486 pages of the \textit{Kölnische Zeitung} from the years 1866 and 1924. We propose solving the newspaper-understanding problem by training a U-Net on our new dataset, which delivers satisfactory performance.
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
Connected Papers (What is Connected Papers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.