当前位置: X-MOL 学术International Journal on Digital Libraries › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Cross-lingual extreme summarization of scholarly documents
International Journal on Digital Libraries Pub Date : 2023-08-10 , DOI: 10.1007/s00799-023-00373-2
Sotaro Takeshita , Tommaso Green , Niklas Friedrich , Kai Eckert , Simone Paolo Ponzetto

The number of scientific publications nowadays is rapidly increasing, causing information overload for researchers and making it hard for scholars to keep up to date with current trends and lines of work. Recent work has tried to address this problem by developing methods for automated summarization in the scholarly domain, but concentrated so far only on monolingual settings, primarily English. In this paper, we consequently explore how state-of-the-art neural abstract summarization models based on a multilingual encoder–decoder architecture can be used to enable cross-lingual extreme summaries of scholarly texts. To this end, we compile a new abstractive cross-lingual summarization dataset for the scholarly domain in four different languages, which enables us to train and evaluate models that process English papers and generate summaries in German, Italian, Chinese and Japanese. We present our new X-SCITLDR dataset for multilingual summarization and thoroughly benchmark different models based on a state-of-the-art multilingual pre-trained model, including a two-stage pipeline approach that independently summarizes and translates, as well as a direct cross-lingual model. We additionally explore the benefits of intermediate-stage training using English monolingual summarization and machine translation as intermediate tasks and analyze performance in zero- and few-shot scenarios. Finally, we investigate how to make our approach more efficient on the basis of knowledge distillation methods, which make it possible to shrink the size of our models, so as to reduce the computational complexity of the summarization inference.



中文翻译:

学术文献的跨语言极限总结

如今科学出版物的数量正在迅速增加,导致研究人员信息过载,并使学者们难以跟上当前的趋势和工作方向。最近的工作试图通过开发学术领域的自动摘要方法来解决这个问题,但迄今为止仅集中在单语环境(主要是英语)上。因此,在本文中,我们探讨了如何使用基于多语言编码器-解码器架构的最先进的神经摘要摘要模型来实现学术文本的跨语言极端摘要。为此,我们为学术领域以四种不同语言编译了一个新的抽象跨语言摘要数据集,这使我们能够训练和评估处理英语论文并生成德语、意大利语、中文和日语摘要的模型。我们展示了用于多语言摘要的新 X-SCITLDR 数据集,并基于最先进的多语言预训练模型对不同模型进行了彻底的基准测试,包括独立总结和翻译的两阶段管道方法,以及直接跨语言模型。我们还探索了使用英语单语摘要和机器翻译作为中间任务的中间阶段训练的好处,并分析了零样本和少样本场景中的性能。最后,我们研究如何在知识蒸馏方法的基础上使我们的方法更加有效,这使得缩小模型的大小成为可能,

更新日期:2023-08-11
down
wechat
bug