当前位置: X-MOL 学术J. Inf. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Short text classification using semantically enriched topic model
Journal of Information Science ( IF 2.4 ) Pub Date : 2024-03-21 , DOI: 10.1177/01655515241230793
Farid Uddin 1 , Yibo Chen 2 , Zuping Zhang 1 , Xin Huang 2
Affiliation  

Modelling short text is challenging due to the small number of word co-occurrence and insufficient semantic information that affects downstream Natural Language Processing (NLP) tasks, for example, text classification. Gathering information from external sources is expensive and may increase noise. For efficient short text classification without depending on external knowledge sources, we propose Expressive Short text Classification (EStC). EStC consists of a novel document context-aware semantically enriched topic model called the Short text Topic Model (StTM) that captures words, topics and documents semantics in a joint learning framework. In StTM, the probability of predicting a context word involves the topic distribution of word embeddings and the document vector as the global context, which obtains by weighted averaging of word embeddings on the fly simultaneously with the topic distribution of words without requiring an additional inference method for the document embedding. EStC represents documents in an expressive (number of topics × number of word embedding features) embedding space and uses a linear support vector machine (SVM) classifier for their classification. Experimental results demonstrate that EStC outperforms many state-of-the-art language models in short text classification using several publicly available short text data sets.

中文翻译:

使用语义丰富的主题模型进行短文本分类

由于单词共现数量较少且语义信息不足,影响下游自然语言处理 (NLP) 任务(例如文本分类),因此对短文本进行建模具有挑战性。从外部来源收集信息的成本很高,而且可能会增加噪音。为了在不依赖外部知识源的情况下进行有效的短文本分类,我们提出了表达性短文本分类(EStC)。 EStC 由一种新颖的文档上下文感知语义丰富主题模型组成,称为短文本主题模型 (StTM),它在联合学习框架中捕获单词、主题和文档语义。在StTM中,预测上下文单词的概率涉及单词嵌入的主题分布和作为全局上下文的文档向量,其通过动态单词嵌入的加权平均与单词的主题分布同时获得,而不需要额外的推理方法用于文档嵌入。 EStC 在表达性(主题数 × 词嵌入特征数)嵌入空间中表示文档,并使用线性支持向量机 (SVM) 分类器进行分类。实验结果表明,EStC 在使用多个公开可用的短文本数据集进行短文本分类方面优于许多最先进的语言模型。
更新日期:2024-03-21
down
wechat
bug