当前位置: X-MOL 学术ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automatic Extractive Text Summarization using Multiple Linguistic Features
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2024-04-08 , DOI: 10.1145/3656471
Pooja Gupta 1, 2 , Swati Nigam 1, 2 , Rajiv Singh 1, 2
Affiliation  

Automatic text summarization (ATS) provides a summary of distinct categories of information using natural language processing (NLP). Low-resource languages like Hindi have restricted applications of these techniques. This study proposes a method for automatically generating summaries of Hindi documents using extractive technique. The approach retrieves pertinent sentences from the source documents by employing multiple linguistic features and machine learning (ML) using maximum likelihood estimation (MLE) and maximum entropy (ME). We conducted pre-processing on the input documents, such as eliminating Hindi stop words and stemming. We have obtained 15 linguistic feature scores from each document to identify the phrases with high scores for summary generation. We have performed experiments over BBC News articles, CNN News, DUC 2004, Hindi Text Short Summarization Corpus, Indian Language News Text Summarization Corpus, and Wikipedia Articles for the proposed text summarizer. The Hindi Text Short Summarization Corpus and Indian Language News Text Summarization Corpus datasets are in Hindi, whereas BBC News articles, CNN News, and the DUC 2004 datasets have been translated into Hindi using Google, Microsoft Bing, and Systran translators for experiments. The summarization results have been calculated and shown for Hindi as well as for English to compare the performance of a low and rich-resource language. Multiple ROUGE metrics, along with precision, recall, and F-measure, have been used for the evaluation, which shows the better performance of the proposed method with multiple ROUGE scores. We compare the proposed method with the supervised and unsupervised machine learning methodologies, including support vector machine (SVM), Naive Bayes (NB), decision tree (DT), latent semantic analysis (LSA), latent Dirichlet allocation (LDA), and K-means clustering, and it was found that the proposed method outperforms these methods.



中文翻译:

使用多种语言特征的自动提取文本摘要

自动文本摘要 (ATS) 使用自然语言处理 (NLP) 提供不同类别信息的摘要。像印地语这样的低资源语言限制了这些技术的应用。本研究提出了一种使用提取技术自动生成印地语文档摘要的方法。该方法通过采用多种语言特征和使用最大似然估计 (MLE) 和最大熵 (ME) 的机器学习 (ML) 从源文档中检索相关句子。我们对输入文档进行了预处理,例如消除印地语停用词和词干提取。我们从每个文档中获得了 15 个语言特征得分,以识别得分较高的短语以生成摘要。我们已经对 BBC 新闻文章、CNN 新闻、DUC 2004、印地语文本简短摘要语料库、印度语言新闻文本摘要语料库和维基百科文章进行了实验,以用于拟议的文本摘要器。印地语文本简短摘要语料库和印度语言新闻文本摘要语料库数据集是印地语,而 BBC 新闻文章、CNN 新闻和 DUC 2004 数据集已使用 Google、Microsoft Bing 和 Systran 翻译器翻译成印地语进行实验。已经计算并显示了印地语和英语的汇总结果,以比较低资源语言和丰富资源语言的性能。多个 ROUGE 指标以及精度、召回率和 F 度量已用于评估,这表明所提出的方法在多个 ROUGE 分数下具有更好的性能。我们将所提出的方法与监督和无监督机器学习方法进行比较,包括支持向量机(SVM)、朴素贝叶斯(NB)、决策树(DT)、潜在语义分析(LSA)、潜在狄利克雷分配(LDA)和K -表示聚类,并且发现所提出的方法优于这些方法。

更新日期:2024-04-08
down
wechat
bug