当前位置: X-MOL 学术Data Knowl. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
User-generated short-text classification using cograph editing-based network clustering with an application in invoice categorization
Data & Knowledge Engineering ( IF 2.5 ) Pub Date : 2023-10-21 , DOI: 10.1016/j.datak.2023.102238
Dewan F. Wahid , Elkafi Hassini

Rapid adaptation of online business platforms in every sector creates an enormous amount of user-generated textual data related to providing product or service descriptions, reviewing, marketing, invoicing and bookkeeping. These data are often short in size, noisy (e.g., misspellings, abbreviations), and do not have accurate classifying labels (line-item categories). Classifying these user-generated short-text data with appropriate line-item categories is crucial for corresponding platforms to understand users’ needs. This paper proposed a framework for user-generated short-text classification based on identified line-item categories. In the line-item identification phase, we used cograph editing (CoE)-based clustering on keywords network, which can be formulated from users’ generated short-texts. We also proposed integer linear programming (ILP) formulations for CoE on weighted networks and designed a heuristic algorithm to identify clusters in large-scale networks. Finally, we outlined an application of this framework to categorize invoices in an empirical setting. Our framework showed promising results in identifying invoice line-item categories for large-scale data.



中文翻译:

使用基于 Cograph 编辑的网络聚类以及发票分类中的应用程序生成用户生成的短文本分类

每个行业在线业务平台的快速适应创建了大量用户生成的与提供产品或服务描述、评论、营销、发票和簿记相关的文本数据。这些数据通常尺寸较短、存在噪声(例如拼写错误、缩写),并且没有准确的分类标签(行项目类别)。将这些用户生成的短文本数据分类为适当的行项目类别对于相应平台了解用户的需求至关重要。本文提出了一个基于已识别的行项目类别的用户生成的短文本分类框架。在行项目识别阶段,我们在关键词网络上使用了基于 Cograph Editing (CoE) 的聚类,该聚类可以根据用户生成的短文本来制定。我们还提出了加权网络上 CoE 的整数线性规划 (ILP) 公式,并设计了一种启发式算法来识别大规模网络中的集群。最后,我们概述了该框架在实证环境中对发票进行分类的应用。我们的框架在识别大规模数据的发票行项目类别方面显示出有希望的结果。

更新日期:2023-10-21
down
wechat
bug