Introduction

Computational literary studies have rapidly grown in prominence over the recent years. One of the most successful directions of inquiry within this domain, in terms of both methodological advances and empirical findings, has been computational stylometry, or computational stylistics: a discipline that develops algorithmic techniques for learning stylistic similarities between texts (Bories et al., 2023; Burrows, 1987; Eder et al., 2016). For this purpose, computational stylometrists extract linguistic features specifically associated with authorial style or individual authorial habits. Often, these features are the most frequent words from the analyzed literary texts—they tend to be function words (“a”, “the”, “on”, etc.)—to which various measures of similarity (e.g., Euclidean distance) are applied. The most common goal of computational stylistics is attributing the authorship of texts where it is disputed like the authorship of Molière’s plays (Cafiero and Camps, 2019), the Nobel Prize-winning novel And Quiet Flows the Don (Iosifyan and Vlasov, 2020), or Shakespeare and Fletcher’s play Henry VIII (Plecháč, 2021). Thanks to numerous systematic comparisons of various approaches to computational stylometry, we now have a fairly good idea of which procedures and textual features are the most effective ones—depending on the goal of stylometric analysis, the language of texts, or their genre (Evert et al., 2017; Neal et al., 2017; Plecháč et al., 2018).

At the same time, we lack such systematic comparisons in the research area that might be called “computational thematics”: the study of thematic similarities between texts. (Thematic similarities: say, that novels A and B both tell a love story or have a “fantasy” setting.) Why is learning about thematic similarities important? Genre—a group of texts united by broad thematic similarities (e.g., fantasy, romance, or science fiction) is a central concept in literary studies, necessary not only for categorizing and cataloging books but also for the historical scholarship of literature. The definition of genres used in this study is more specific: following a long tradition of evolutionary thinking in the humanities (Fowler, 1971; Moretti, 2005), we understand genres as evolving populations of texts that emerge at certain moments of time, spread across the field of literary production, and then disappear in their original form—usually becoming stepping stones for subsequent genres (Fowler, 1971). For example, the genre of “classical” detective fiction crystallized in the 1890–1930s and then gave birth to multiple other genres of crime fiction, such as “hardboiled crime fiction”, “police procedural”, “historical detective”, and others (Symons, 1985). This understanding of cultural phenomena as populations, consisting of items (in our case, books) similar to each other due to common descent (that is, shared influences) is also common in the research on cultural evolution (Baraghith, 2020; Houkes, 2012).

Studying the historical dynamics of genres—not only in literature, but also in music or visual arts—is an important task of art history and sociology, and digital archives allow doing so on a much larger scale (Allison et al., 2011; Klimek et al., 2019; Sigaki et al., 2018). But to gain the most from this larger scale, we must identify the best, most reliable algorithms for detecting the thematic signal in books—similarly to how computational stylometrists have learned the most effective algorithms for detecting the signal of authorship.

Quantitative analysis of genres usually takes one of these forms. The first one is the manual tagging of books by genre—often, through large-scale crowdsourced efforts, like the Goodreads website (Thelwall, 2019). This approach is prone to human bias, it is laborious and also based on the idea that the differences between genre populations are qualitative, not quantitative (e.g., a certain book is either a “detective” or “romance”, or both, but not 0.78 detectives and 0.22 romance, which, we think, would be a more informative description). The second approach is an extension of manual tagging: supervised machine learning of book genres using a training dataset with manually tagged genres (Piper et al., 2021; Underwood, 2019). This approach has important strengths: it is easily scalable and it provides not qualitative but quantitative estimates of a book’s belongingness to a genre. Still, it has a problem: it can only assign genre tags included in the training dataset, and it cannot find new, unexpected book populations—which is an important component of the historical study of literature. The third approach is unsupervised clustering of genres: algorithmic detection of book populations based on their similarity to each other (Calvo Tello, 2021; Schöch, 2017). This approach is easily scalable, allows quantitative characterization of book genres, and does not require a training dataset with manually assigned tags, thus allowing to detection of previously unknown book populations. All these features of unsupervised clustering make it highly suitable for historical research, and this is why we will focus on it in this paper.

Unsupervised clustering can be conducted in a variety of ways. For example, texts can be lemmatized or not lemmatized; as text features, simple word frequencies can be used or some higher-level units, such as topics of a topic model; to measure the similarity between texts, a host of distance metrics can be applied. Hence, the question is: what are the best computational methods for detecting thematic similarities in literary texts? This is the main question of this paper. To answer it, we will compare various combinations of (1) pre-processing (which, in this study, we will also call “thematic foregrounding”), (2) text features, and (3) the metrics used for measuring the distance between features. To assess the effectiveness of these combinations, we use a tightly controlled corpus of four well-known genres—detective fiction, science fiction, fantasy, and romance—as our “ground truth” dataset. To illustrate the significant difference between the best and the worst combinations of algorithms for genre detection, we later cluster genres in a much larger corpus, containing 5000 works of fiction.

Materials and methods

Data: The “ground truth” genres

Systematic research on computational stylistics is common, while research on computational thematics is still rare (Allison et al., 2011; Schöch, 2017; Šeļa et al., 2022; Underwood, 2016; Wilkens, 2016). Why? Computational stylistics has clear “ground truth” data against which various methods of text analysis can be compared: authorship. The methods of text analysis in computational stylistics (e.g., Delta distance or Cosine distance) can be compared as to how well they perform in the task of classifying texts by their authorship. We write “ground truth” in quotes, as authorship is no more than a convenient proxy for stylistic similarity, and, as any proxy, it is imprecise. It assumes that texts written by the same author should be more similar to each other than texts written by different authors. However, we know many cases when the writing style of an author would evolve significantly over the span of their career or would be deliberately manipulated (Brennan et al., 2012). Authorship as a proxy for “ground truth” is a simplification—but a very useful one.

The lack of a widely accepted “ground truth” proxy for thematic analysis leads to the comparisons of algorithms that are based on nothing more than subjective judgment (Egger and Yu, 2022). Such subjective judgment cannot lead us far: we need quantitative metrics of the performance of different algorithms. For this, an imperfect “ground truth” is better than none at all. What could become such an imperfect, but still useful, ground truth in computational thematics? At the moment, these are genre categories. They capture, to a different degree, the thematic similarities between texts. To a different degree, as genres can be organized according to several principles, or “axes of categorization”: e.g., they can be based on the similarity of storylines (adventure novel, crime novel, etc.), settings (historical novel, dystopian novel, etc.), emotions they evoke in readers (horror novel, humorous novel, etc.), or their target audience (e.g., young adult novels). It does seem that these various “axes of categorization” correlate: say, “young adult” novels are appreciated by young adults because they often have particular topics or storylines. Or, horror novels usually have a broad, but consistent, arsenal of themes and settings that are efficient at evoking pleasant fear in readers (like the classical haunted house). Still, some axes of genre categorization are probably better for comparing the methods of computational thematics than others. Genres defined by their plots or settings may provide a clearer thematic signal than genres defined by their target audience or evoked emotions.

We have assembled a tightly controlled corpus of four genres (50 texts in each) based on their plots and settings:

  • Detective fiction (recurrent themes: murder, detective, suspects, investigation).

  • Fantasy fiction (recurrent themes: magic, imaginary creatures, quasi-medieval setting).

  • Romance fiction (recurrent themes: affection, erotic scenes, love triangle plot).

  • Science fiction (recurrent themes: space, future, technology).

Our goal was to include books that would be rather uncontroversial members of their genres. Thus, we picked canonical representatives of each genre, winners of genre-based prizes (e.g., Hugo and Nebula awards for science fiction, the Gold Dagger award for detective fiction), and books with the largest numbers of ratings in respective genres on the Goodreads website (www.goodreads.com). We took several precautions to remove potential confounds. First, these genres are situated on a similar level of abstraction: we are not comparing rough-grain categories (say, romance or science fiction) to fine-grain ones (historical romance or cyberpunk science fiction). Second, we limited the time span of the book publication year to a rather short period of 1950–1999: to make sure that our analysis is not affected too much by language change (which would inevitably happen if we compared, for example, 19th-century gothic novels to 20th-century science fiction). Third, each genre corpus has a similar number of authors (29–31 authors), each represented by 1–3 texts. Several examples of books in each genre are shown in Table 1. The complete list of books is in Supplementary Materials. Before starting our analysis, we pre-registered this list on Open Science Framework’s website (https://osf.io/rce2w).

Table 1 Examples of books in each genre corpus (full list in Supplementary materials).

Analysis: The race of algorithms

To compare the methods of detecting thematic signals, we developed a workflow consisting of four steps—see Fig. 1. Same as our corpus, all the detailed steps of the workflow were pre-registered.

Fig. 1: Four steps of the analysis.
figure 1

The workflow includes two loops. Big loop goes through various combinations of thematic foregrounding (Step 1a), feature type (1b), and distance metric (1c). For each such combination, a smaller loop is run: it randomly draws a genre-stratified sample of 120 novels (Step 2), clusters the novels using the Ward algorithm (Step 3), and validates the clusters on the dendrogram using the Adjusted Rand Index (Step 4). As a result, each combination receives an ARI score: a score of its performance in detecting genres.

Step 1. Choosing a combination of thematic foregrounding, features, and distance

As a first step, we choose a combination of (a) the level of thematic foregrounding, (b) the features of analysis, and (c) the measure of distance.

By thematic foregrounding (Step 1a on Fig. 1) we mean the extent to which the thematic aspects of a text are highlighted (and the stylistic aspects—backdropped). With weak thematic foregrounding, only the most basic text pre-processing is done: lemmatizing words and removing 100 most frequent words (MFWs)—the most obvious carriers of strong stylistic signal. 100 MFWs roughly correspond to function words (or closed-class words) in English, routinely used in authorship attribution (Chung and Pennebaker, 2007; Stamatatos, 2009) beginning with the classical study of Federalist Papers (Mosteller and Wallace, 1963). With medium thematic foregrounding, in addition to lemmatizing, we also remove entities (named entities, proper names, etc.) using SpaCy tagger (https://spacy.io/). Additionally, we perform part-of-speech tagging and remove all the words that are not nouns, verbs, adjectives, or adverbs, which are the most content-bearing parts of speech. With strong thematic foregrounding, in addition to all the steps of the medium foregrounding, we also apply lexical simplification. We simplify the vocabulary by replacing less frequent words with their more frequent synonyms—namely, we replace all words outside of 1000 MFWs with their more common semantic neighbors (out of 10 closest neighbors), with the help of pre-trained FastText model that includes 2 million words and is trained on English Wikipedia (Grave et al., 2018).

Then, we transform our pre-processed texts into lists of features (Step 1b on Fig. 1). We vary both the type of features and the length of lists. We consider four types of features. The simplest features are the most frequent words as used in the bag-of-words approach (1000, 5000, or 10,000 of them)—a common choice for thematic analysis in computational literary studies (Hughes et al., 2012; Underwood, 2019). The second type of feature is topic probabilities generated with the Latent Dirichlet Allocation (LDA) algorithm (Blei et al., 2003)—another common choice (Jockers, 2013; Liu et al., 2021). LDA has several parameters that can influence results, such as the predefined k of topics or the number of most frequent words used. Plus, a long text like a novel is too large for meaningful LDA topic modeling, and the typical solution is dividing the text into smaller chunks. We use an arbitrary chunk size of 1000 words. The third type of feature is modules generated with weighted correlation network analysis, also known as weighted gene co-expression network analysis (WGCNA)—a method of dimensionality reduction that detects clusters (or “modules”) in networks (Langfelder and Horvath, 2008). WGCNA is widely used in genetics (Bailey et al., 2016; Ramírez-González et al., 2018), but also showed promising results as a tool for topic modeling of fiction (Elliott, 2017). We used it with either 1000 or 5000 most frequent words. Typically, WGCNA is used without chunking data, but, since chunking leads to better results in LDA, we decided to try using WGCNA with and without chunking, with the chunk size of 1000 words. All the parameters of WGCNA were kept at defaults. Finally, as the fourth type of feature, we use document-level embeddings doc2vec (Lau and Baldwin, 2016; Le and Mikolov, 2014) that directly position documents in a latent semantic space defined by a pre-trained distributional language model—FastText (Grave et al., 2018). Document representations in doc2vec depend on the features of the underlying model: in our study, each document is embedded in 300 dimensions of the original model. Doc2vec and similar word embedding methods are increasingly used for assessing the similarity of documents (Dynomant et al., 2019; Kim et al., 2019; Pranjic et al., 2020). As a result of Step 1b, we obtain a document-term matrix formed of texts (rows) and features (columns).

Finally, we must learn the similarity between the texts represented with the chosen lists of features—by using some metric of distance (Step 1c on Fig. 1). There exist a variety of metrics for this purpose: Euclidean, Manhattan, Delta, Cosine, Cosine Delta distances and Jensen–Shannon divergence (symmetrized Kullback–Leibler divergence) for features that are probability distributions (in our case, this can be done for LDA topics and bag-of-words features).

Variants of Step 1a, 1b, and 1c, can be assembled in numerous combinations. In our “race of algorithms”, each combination is a competitor—and a potential winner. Say, we could choose a combination of weak thematic foregrounding, LDA topics with 50 topics on 5000 most frequent words, and Euclidean distance. Or, medium thematic foregrounding, simple bag-of-words with 10,000 most frequent words, and Jensen–Shannon divergence. Some of these combinations are researchers’ favorites, while others are underdogs—used rarely, or not at all. Our goal is to map out the space of possible combinations—to empirically test how each combination performs in the task of detecting the thematic signal. In total, there are 291 competing combinations.

Step 2. Sampling for robust results

A potential problem with our experiment is that some combinations might perform better or worse simply because they are more suitable to our specific corpus—for whatever reason. To reduce the impact of individual novels in our corpus, we use cross-validation: instead of analyzing the corpus as a whole, we analyze smaller samples from it multiple times. Each sample contains 120 novels: 30 books from each genre. Altogether, we perform the analysis for each combination on 100 samples. For each sample, all the models that require training—LDA, WGCNA, and doc2vec—are trained anew.

Step 3. Clustering

As a result of Step 2, we obtain a matrix of text distances. Then, we need to cluster the texts into groups—our automatically generated genre clusters, which we will later compare to the “true” clusters. For this, we could have used a variety of algorithms (e.g., k-means). We use hierarchical clustering with Ward’s linkage (Ward, 1963): it clusters two items when the resulting clusters maximize variance across the distance matrix. Despite being originally defined only for Euclidean distances, it was empirically shown that Ward’s algorithm outperforms other linkage strategies in text-clustering tasks (Ochab et al., 2019). We assume that novels from four defined genres should roughly form four distinct clusters (as the similarity of texts within the genre is greater than the similarity of texts across genres). To obtain the groupings from a resulting tree we cut it vertically by the number of assumed clusters (which is 4).

Step 4. Cluster validation

How similar are our generated clusters to the “true” genre populations? To learn this, we compare the clusters generated by each chosen combination to the original genre labels. For this, we use a measure of cluster validation called the adjusted Rand index (ARI) (Hubert and Arabie, 1985). The ARI score of a particular combination shows how well this combination performs in the task of detecting genres—and thus, in picking the thematic signal. Steps 1–4 are performed for every combination so that every combination receives its ARI score. At the end of the analysis, we obtained a dataset of 29,100 rows (291 combinations, each tested on 100 random samples).

Results

Figure 2 shows the average performance of all the combinations of thematic foregrounding, features, and distance metrics. Our first observation: the average ARI of the best-performing algorithms ranges from 0.66 to 0.7, which is rather high for the complicated, noisy data that is literary fiction. This gives additional support to the idea that unsupervised clustering of fiction genres is possible. Even a cursory look at the 10 best-performing combinations immediately reveals several trends. First, none of the top combinations have weak thematic foregrounding. Second, 6 out of 10 best-performing features are LDA topics. Third, 8 out of 10 distances on this list are Jensen–Shannon divergence.

Fig. 2: Raw distributions of ARI scores for all the combinations of thematic foregrounding, feature type, and distance metric.
figure 2

Boxplots are colored by feature type. Numbers on the horizontal axis correspond to the names of combinations in the table to the right, showing 10 best-performing combinations (see all the combinations in Supplement, Table S7).

But how generalizable are these initial observations? How shall we learn the average “goodness” of a particular kind of thematic foregrounding, feature type, or distance metric? To learn this, we need to control for their influence on each other, as well as for additional parameters, such as the number of most frequent words and chunking. Hence, we have constructed five Bayesian linear regression models (see Supplement 5.1). They answer questions about the performance of various combinations of thematic foregrounding, features, and distances, helping us reach conclusions about the performance of individual steps of thematic analysis. All the results of this study are described in detail in Supplement 5.1. Below, we focus only on key findings.

Conclusion 1. Thematic foregrounding improves genre clustering

The goal of thematic foregrounding was to highlight the contentful parts of the texts and to backdrop the stylistic parts. So, does larger thematic foregrounding improve genre clustering? As expected, we have found that low thematic foregrounding shows the worst performance across all four feature types (see Fig. 3). For LDA and bag-of-words, it leads to drastically worse performance. At the same time, we do not see a large difference between the medium and the strong levels of thematic foregrounding. The major difference in the strong level of thematic foregrounding is the use of lexical simplification. However, lexical simplification has not led to a noticeable improvement in genre recognition. The gains of using strong thematic foregrounding for document embeddings, LDA, and bag-of-words are marginal and inconsistent.

Fig. 3
figure 3

The effect of thematic foregrounding (weak, medium, or strong) on clustering genres, stratified by feature type.

Conclusion 2. Various feature types show similarly good performance

Does the choice of feature type matter for the performance of genre clustering? We have found that almost all feature types can perform well. As shown in Fig. 2, three out of four feature types—doc2vec, LDA, and bags of words—when used in certain combinations, can lead to almost equally good results. But how good are they on average? Fig. 4 shows the posterior distributions of ARI for each type of feature—in each case, for a high level of thematic foregrounding.

Fig. 4
figure 4

Posterior distributions of ARI scores for four feature types, at a high level of thematic foregrounding.

As we see, doc2vec shows the best average performance, but this study has not experimented enough with other parameters of this feature type. It might be that another number of dimensions (e.g., 100 instead of 300) would worsen its performance. More research is needed to better understand the performance of doc2vec. LDA is the second-best approach—and interestingly, the variation of parameters in LDA (such as k of topics or n of MFWs) does not increase the variance compared to doc2vec. The bag-of-words approach, despite being simplest, proves to be surprisingly good. It does not demonstrate the best performance, but it is not far behind LDA. At the same time, bags of words have a powerful advantage: simplicity. They are simpler to use and require fewer computational resources, meaning that in many cases they can still be a suitable choice for thematic analysis. Finally, WGCNA shows the worst ARI scores on average.

Conclusion 3. The performance of LDA does not seem to depend on k of topics and n of most frequent words

LDA modeling depends on parameters, namely k of topics and n of most frequent words, which should be decided, somewhat arbitrarily, before modeling. There exist algorithms for estimating the “good” number of topics, which help assess how many topics are “too few” and how many are “too many” (Sbalchiero and Eder, 2020). In our study, however, we find no meaningful influence of either of these choices on learning the thematic signal (Fig. 5). The single most important factor making a massive influence on the effectiveness of thematic classification is thematic foregrounding. Weak thematic foregrounding (in our case, only lemmatizing words and removing the 100 most frequent words) proves to be a terrible choice that noticeably reduces ARI scores. Our study points towards the need for further systematic comparisons of various approaches to thematic foregrounding, as it plays a key role in the solid performance of LDA.

Fig. 5: Posterior probabilities of the effects of k of topics on ARI, stratified by the level of thematic foregrounding and n of most frequent words used in LDA.
figure 5

Error bars show 95% credible intervals.

Conclusion 4. Bag-of-words approach requires a balance of thematic foregrounding and n of most frequent words

Bags of words are the simplest type of feature in thematic analysis, but still an effective one, as we have demonstrated. But how does one maximize the chances that bags of words perform well? We have varied two parameters in the bag-of-words approach: the level of thematic foregrounding and the number of MFWs used. Figure 6 illustrates our findings: both these parameters influence performance. Using 5000, instead of 1000, MFWs drastically improve ARI scores. Similarly, using medium, instead of weak, thematic foregrounding, makes a big difference. At the same time, pushing these two parameters further—using 10,000 MFWs and strong thematic foregrounding—brings only marginal, if any, improvement in ARI scores.

Fig. 6: The influence of the number of most frequent words, used as text features, on learning the thematic signal, measured with ARI.
figure 6

There is a positive relationship between the n of words and ARI, as well as between the level of thematic foregrounding and ARI. However, the middle parameter values of both (5000 MFWs and medium foregrounding) should be enough for most analyses.

Conclusion 5. Jensen–Shannon divergence is the best distance metric for genre recognition, Euclidean—the worst

Choosing the right distance metric is crucial for improving genre clustering. Figure 7 shows the performance of various distances for each type of feature (note that Jensen–Shannon divergence, which was formulated for probability distributions, could not be applied to doc2vec dimensions and WGCNA module weights). For LDA and bag-of-words, Jensen–Shannon divergence is the best distance, with Delta and Manhattan distances being highly suitable too. For doc2vec, the choice of distance matters less. Interestingly, Euclidean distance is the worst-performing distance for LDA, bag-of-words, and WGCNA. This is good to know, as this distance is often used in text analysis, also in combination with LDA (Jockers, 2013; Schöch, 2017; Underwood et al., 2022), while our study suggests that this distance should be avoided in computational thematic analysis. Cosine distance is known to be useful for authorship attribution when combined with bag-of-words as a feature type. At the same time, cosine distance is sometimes used to measure the distances between LDA topic probabilities, and our study shows that it is not the best combination.

Fig. 7: The influence of distance metrics on ARI scores, separately for each feature type.
figure 7

Note that Jensen–Shannon divergence could not be combined with WGCNA and doc2vec.

Comparison of algorithms on a larger dataset

How well does this advice apply to clustering other corpora, not just our corpus of 200 novels? A common problem in statistics and machine learning is overfitting: tailoring one’s methods to a particular “sandbox” dataset, without making sure that these methods work “in the wild”. In our case, this means: would the same combinations of methods work well/poorly on other genres and other books than those included in our analysis? One precaution that we took to deal with overfitting was sampling from our genre corpus: instead of analyzing the full corpus just once, we analyzed smaller samples from it. But, additionally, it would be useful to compare the best-performing and the worst-performing methods against a much larger corpus of texts.

For this purpose, we use a sample of 5000 books of the NovelTM dataset of fiction, built from HathiTrust corpus (Underwood et al., 2020). Unlike our small corpus of four genres, these books do not have reliable genre tags, so we could not simply repeat our study on this corpus. Instead, we decided to inspect how a larger sample of our four genres (detective, fantasy, science fiction, and romance) would cluster within the HathiTrust corpus. For this, we included all the books in these four genres that we could easily identify and seeded them into a random sample of 5000 works of fiction. Then we clustered all these books using one of the best combinations of methods for identifying genres (medium thematic foregrounding, LDA on 1000 words with 100 topics, clustered with Delta distance). The result, visualized with a UMAP projection (McInnes et al., 2018), is shown in Fig. 8. This is just an illustration of the possible first step toward further testing various algorithms of computational thematics “in the wild”. The assessment of the accuracy of clustering of this larger corpus, and the comparison of it with the clustering using one of the worst-performing combinations of methods, are given in the Supplement (section 5.2).

Fig. 8: A UMAP projection of 5000 novels, randomly sampled from the NovelTM HathiTrust corpus and all the novels by the authors included in the original corpus of four genres, found in NovelTM.
figure 8

The figure is based on one of the best-performing combinations.

Discussion

This study aimed to answer the question: how good are various techniques of learning thematic similarities between works of fiction? In particular, how good are they at detecting genres—and are they good at all? For this, we tested various techniques of text mining, belonging to three consecutive steps of analysis: pre-processing, extraction of features, and measuring distances between the lists of features. We used four common genres of fiction as our “ground truth” data, including a tightly controlled corpus of books. Our main finding is that unsupervised learning can be effectively used for detecting thematic similarities, but algorithms differ in their performance. Interestingly, the algorithms that are good for computational stylometry (and its most common task, authorship attribution) are not the same as those good for computational thematics. To give an example, one common approach to authorship attribution—using limited pre-processing, with a small number of most frequent words as features, and cosine distance—is one of the least accurate approaches for learning thematic similarities. How important are these differences in the real-world scenario, not limited to our small sample of books? To test this, we have contrasted one of the worst-performing combinations of algorithms, and one of the best-performing combinations, using a large sample of the HathiTrust corpus of books.

Systematic comparisons between various algorithms for computational thematic analysis will be key for a better understanding of which approaches work and which do not—a requirement for assuring reliable results in the growing research area which we suggest calling “computational thematics”. Using a reliable set of algorithms for thematic analysis would allow tackling several large problems that remain not solved in the large-scale analysis of books. One such problem is creating better genre tags for systematizing large digital libraries of digitized texts. Manual genre tags in corpora such as HathiTrust are often missing or are highly inconsistent, which leads to attempts to use supervised machine learning, trained on manually tagged texts, to automatically learn the genres of books in the corpus overall. However, this approach, by design, allows capturing only the genres we already know about, not the genres we do not know exist: “latent” genres. Unsupervised thematic analysis can be used for this task. Another important problem that unsupervised approaches to computational thematics may be good at is the historical analysis of literary evolution. So far, we are lacking a comprehensive “map” of literary influences, based on the similarity of books. Such a map would allow creation of a computational model of literary macroevolution, similar to phylogenetic trees (Bouckaert et al., 2012; Tehrani, 2013) or rooted phylogenetic networks (Neureiter et al., 2022; Youngblood et al., 2021) used in cultural evolution research of languages, music, or technologies. Having reliable unsupervised algorithms for measuring thematic similarities would be crucial for any historical models of this sort. Also, measuring thematic similarities may prove useful for creating book recommendation systems. Currently, book recommendation algorithms are mostly based on the analysis of user behavior: ratings or other forms of interaction (Duchen, 2022). Such methods are highly effective in cases when user-generated data is abundant, like songs or brief videos. However, for longer content types, which take more time to consume, the amount of user-generated data is much smaller. Improving the tools for content-based similarity detection in books would allow recommending books based on their content—as it is already happening to songs: projects such as Spotify’s Every Noise at Once (https://everynoise.com/) combine user behavior data with the acoustic features of songs themselves to learn the similarity between songs and recommend them to listeners.

This study is a preliminary attempt at systematizing various approaches to computational thematics. More work is needed to further test the findings of this paper and to overcome its limitations. First, any clustering approach, be it a dendrogram or a projection, is a simplification of the relationships described by a distance matrix, which, in turn, is based on aggregating information from imperfect proxies (e.g., words and topics). This poses two major limitations to the unsupervised analysis of fiction: (1) the ability to generalize over an observed clustering that arises from a very specific set of choices and parameters, (2) the lack of control over textual features and the representation of novels. Clustering limitations are often overcome by using bootstrap approaches (Eder, 2017), or meta-analysis of clusters (‘clustering of clusters’) for the detection of stable, previously unidentifiable subgroups (Calvo Tello, 2021); in recent years, there were major advances in Bayesian clustering methods that provide non-parametric solutions to data partitioning and allow to associate uncertainty with data membership in clusters, or with number of clusters in a dataset (Wade and Ghahramani, 2018). These appear particularly suitable to cultural data, yet come with their own challenges in presentation and reporting.

Another limitation of our study is its inability to distinguish between thematic and formal elements of texts. A narrative can be told from the first-person perspective or from the standpoint of the omniscient narrator; events in a story can follow the chronological order or may be re-ordered, with flashbacks, flashforwards, and various other temporal patterns. Formal structures like these are not really “thematic” and require separate examination. Ideally, one would have to extract them as a separate set of textual features and use them in the analysis alongside themes. Unfortunately, we only have a limited set of tools for finding such formal patterns—though novel computational methods offer a promise of capturing some of them (Langlais, 2023; Piper and Toubia, 2023). Also, one may argue, that the methods of feature extraction used in this paper do capture not only topics as such but also some of the formal variation in texts, even though it is not clear where the line between forms and themes should be drawn. For example, the narrative perspective is partly captured by the distribution of pronouns and verb forms, pushing the first-person and third-person books to form different groups.

Yet another limitation is the concept of “ground truth” genres. It may be noted—rightly—that there are no “true” genres and that genre tags overall may not be the best approach for testing thematic similarities. As further steps, we see using large-scale user-generated tags from Goodreads and similar websites as a proxy for “ground truth” similarity.

Finally, this study has certainly not exhausted all the possible techniques for text analysis that can be used for computational thematics. For example, much wider testing of vector models, like doc2vec, but also BERTopic (Grootendorst, 2022) or Top2Vec (Angelov, 2020) is an obvious next step, as well as testing other network-based methods for community detection (Gerlach et al., 2018). Text simplification could have a large potential for thematic analysis—it must be explored further. Possibly, the most straightforward way to test our findings would be to attempt to replicate them on other genre corpora, containing more books or other genres. Testing these methods on books in other languages is also critical. The approach taken in this paper offers a simple analytical pipeline—and we encourage other researchers to use it for testing other computational approaches. Such a communal effort will be key for assuring robust results in the area of computational thematics.