当前位置: X-MOL 学术ACM Trans. Knowl. Discov. Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
FETILDA: An Evaluation Framework for Effective Representations of Long Financial Documents
ACM Transactions on Knowledge Discovery from Data ( IF 3.6 ) Pub Date : 2024-04-10 , DOI: 10.1145/3657299
Bolun (Namir) Xia 1 , Vipula Rawte 1 , Aparna Gupta 2 , Mohammed Zaki 3
Affiliation  

In the financial sphere, there is a wealth of accumulated unstructured financial data, such as the textual disclosure documents that companies submit on a regular basis to regulatory agencies, such as the Securities and Exchange Commission (SEC). These documents are typically very long and tend to contain valuable soft information about a company’s performance that is not present in quantitative predictors. It is therefore of great interest to learn predictive models from these long textual documents, especially for forecasting numerical key performance indicators (KPIs). In recent years, there has been a great progress in natural language processing via pre-trained language models (LMs) learned from large corpora of textual data. This prompts the important question of whether they can be used effectively to produce representations for long documents, as well as how we can evaluate the effectiveness of representations produced by various LMs. Our work focuses on answering this critical question, namely the evaluation of the efficacy of various LMs in extracting useful soft information from long textual documents for prediction tasks. In this paper, we propose and implement a deep learning evaluation framework that utilizes a sequential chunking approach combined with an attention mechanism. We perform an extensive set of experiments on a collection of 10-K reports submitted annually by US banks, and another dataset of reports submitted by US companies, in order to investigate thoroughly the performance of different types of language models. Overall, our framework using LMs outperforms strong baseline methods for textual modeling as well as for numerical regression. Our work provides better insights into how utilizing pre-trained domain-specific and fine-tuned long-input LMs for representing long documents can improve the quality of representation of textual data, and therefore, help in improving predictive analyses.



中文翻译:

FETILDA:长财务文档有效表示的评估框架

在金融领域,积累了大量的非结构化金融数据,例如公司定期向美国证券交易委员会(SEC)等监管机构提交的文本披露文件。这些文件通常很长,并且往往包含有关公司业绩的有价值的软信息,而定量预测中不存在这些信息。因此,从这些长文本文档中学习预测模型非常有意义,特别是对于预测数字关键绩效指标(KPI)。近年来,通过从大型文本数据语料库中学习的预训练语言模型(LM),自然语言处理取得了巨大进展。这就提出了一个重要问题:它们是否可以有效地用于生成长文档的表示,以及我们如何评估各种 LM 生成的表示的有效性。我们的工作重点是回答这个关键问题,即评估各种语言模型从长文本文档中提取有用的软信息以执行预测任务的有效性。在本文中,我们提出并实现了一个深度学习评估框架,该框架利用顺序分块方法与注意机制相结合。我们对美国银行每年提交的 10-K 报告集合和美国公司提交的另一个报告数据集进行了广泛的实验,以彻底调查不同类型语言模型的性能。总体而言,我们使用语言模型的框架在文本建模和数值回归方面优于强大的基线方法。我们的工作提供了更好的见解,让我们了解如何利用预先训练的特定领域和经过微调的长输入 LM 来表示长文档,从而提高文本数据表示的质量,从而有助于改进预测分析。

更新日期:2024-04-10
down
wechat
bug