Computational thematics: comparing algorithms for clustering the genres of literary fiction,Humanities & Social Sciences Communications

当前位置： X-MOL 学术 › Humanit. Soc. Sci. Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Computational thematics: comparing algorithms for clustering the genres of literary fiction
Humanities & Social Sciences Communications ( IF 2.731 ) Pub Date : 2024-03-20 , DOI: 10.1057/s41599-024-02933-6
Oleg Sobchuk , Artjoms Šeļa

What are the best methods of capturing thematic similarity between literary texts? Knowing the answer to this question would be useful for automatic clustering of book genres, or any other thematic grouping. This paper compares a variety of algorithms for unsupervised learning of thematic similarities between texts, which we call “computational thematics”. These algorithms belong to three steps of analysis: text pre-processing, extraction of text features, and measuring distances between the lists of features. Each of these steps includes a variety of options. We test all the possible combinations of these options. Every combination of algorithms is given a task to cluster a corpus of books belonging to four pre-tagged genres of fiction. This clustering is then validated against the “ground truth” genre labels. Such comparison of algorithms allows us to learn the best and the worst combinations for computational thematic analysis. To illustrate the difference between the best and the worst methods, we then cluster 5000 random novels from the HathiTrust corpus of fiction.

中文翻译：

计算主题：比较文学小说类型聚类算法

捕捉文学文本之间主题相似性的最佳方法是什么？知道这个问题的答案对于图书类型或任何其他主题分组的自动聚类很有用。本文比较了文本之间主题相似性的无监督学习的各种算法，我们称之为“计算主题”。这些算法属于三个分析步骤：文本预处理、文本特征提取以及测量特征列表之间的距离。每个步骤都包含多种选项。我们测试这些选项的所有可能组合。每个算法组合都被赋予一项任务，即对属于四种预先标记的小说类型的书籍语料库进行聚类。然后根据“真实情况”流派标签验证该聚类。这种算法比较使我们能够了解计算主题分析的最佳和最差组合。为了说明最佳方法和最差方法之间的差异，我们从 HathiTrust 小说语料库中随机聚类了 5000 部小说。

更新日期：2024-03-21

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>