1 Introduction

Massive Open Online Course (MOOC) platforms continue to grow dramatically; in recent years, Coursera, edX and FutureLearn have emerged as popular platforms (Joseph 2020). They support a wide variety of efficiently delivered courses with easy access, which open numerous learning opportunities to anyone wanting to learn a specific topic or obtain new information. As the popularity of MOOCs continues to grow, ever more people with a broad diversity of background knowledge and goals are enrolling on these platforms from around the world. For over a decade now, many research communities have contributed to the development of these platforms and proposed specific solutions to the challenges and barriers they face, such as learner engagement (Anderson et al. 2014), learner motivation (Durksen et al. 2016) and learner performance (Jiang et al. 2014). One way to lift some of these barriers is via accurate and on-time instructor intervention on MOOC discussion forums (Alrajhi et al. 2021).

Discussion forums enable online learners to express their ideas, ask questions, and seek help (Crossley et al. 2015). In addition, they create social connections and facilitate communication among learners and instructors (Stump et al. 2013). In this context, instructors play an important role in monitoring comments, especially to provide the help and support learners need: an accurate on-time intervention may make the difference between a learner continuing on the course or dropping out; indeed, (Alrajhi et al. 2021) showed that learners are less likely to finish the course (about 13%) if they frequently make comments that require intervention. However, continuously monitoring such huge numbers of comments is a time-consuming and sometimes overwhelming task for instructors: hundreds or even thousands of comments are posted during courses, sometimes for each course step, and identifying those which require urgent intervention can be almost impossible. This is exacerbated by the high ratio of learners to instructors (Almatrafi and Johri 2018).

As MOOCs generate huge amounts of textual data, another way of addressing this issue is via Natural Language Processing (NLP). The work presented in this paper uses NLP to help instructors to address urgent comments and enable them to decide when to react, by creating an automatic text-classification model.

Another core problem in this area is the intrinsically imbalanced nature of the data; such datasets are characterised by a highly skewed class distribution due to the (naturally) small number of ‘urgent comment’ instances. In text classification tasks, performance often depends on the quality of the data (Wei and Zou 2019). Therefore, to tackle the imbalanced data problem and improve the size and quality of the training data, we manipulate the dataset by: text augmentation, text augmentation with undersampling, and undersampling.

To illustrate the usage of the fine-grained learner models in adaptive support for instructor intervention, we describe an adaptation case where instructors can decrease their workload by using one of our models. We also showcase an expanded model that uses more extensive learner knowledge (based on the number of comments per learner), to discuss how such adaptation models can be further expanded.

1.1 Contributions

The main contributions of our research are, to the best of our knowledge:

  • Creating the first learner, instructor and adaptation models to support instructors to deal with urgent comments in MOOCs.

  • For the first time in the literature, applying data balancing techniques for shallow and deep machine learning to identify instances when urgent instructor intervention is required on MOOCs. These techniques include text augmentation, text augmentation with undersampling, and undersampling to overcome the imbalanced data problem and improve performance. This is achieved by ‘forcing’ the algorithm to increase the weight of the minority class.

  • Creating the first gold standard corpus MOOC Urgent iNstructor InTErvention (UNITE) for instructor intervention in MOOC environments (the FutureLearn platform), which has been annotated by carefully selected experts in the field. This will be made available (after ethical cleansing) to the research community.

  • Proposing several new pipelines (3X and 9X) to generate more data for text augmentation by incorporating different NLP augmenters and providing a range of approaches.

  • Showcasing the challenges and difficulties involved in instructor-intervention decisions in MOOC environments, by manually inspecting and analysing the (relatively small) set of errors generated by the best classifier, along with the best data balancing and text augmentation solutions.

2 Literature review

Today, the instructor intervention problem is one of the most challenging in MOOC environments. Separately, a related, even less explored area of research has emerged, identifying the difficult area of urgent posts detection (Almatrafi et al. 2018; Guo et al. 2019; Alrajhi et al. 2020; Khodeir 2021). However, an obvious omission is that for urgent posts, imbalanced data are a characteristic of the data itself (as there are less urgent comments than non-urgent, normally). This fact has been overlooked in urgent post detection. The closest research to this (Almatrafi et al. 2018; Khodeir 2021) considered some standard techniques: splitting the data, training the model, and selecting the evaluation metrics, but not dealing with improve data imbalance. In addition, while available intervention models for urgent comments concentrated on classifying posts, they did not pay any attention to the behaviours of learners or designed adaptive instructor intervention models based on learner (or instructor) models. Therefore, this section reviews the literature areas closest to our proposal: (1) the important area of instructor intervention in MOOCs, focusing on urgent posts, (2) the area of text augmentation, specifically for balancing data, and (3) adaptive models in MOOCs.

2.1 Instructor intervention in MOOC forums

In 2012, the first MOOC-like discussion forums were developed and immediately aroused researchers’ interest. According to (Almatrafi and Johri 2018), 234 researchers inspected discussion forums from 2013 to 2018. Only as recently as 2014, (Chaturvedi et al. 2014) first investigated the intervention problem, by building numerous models to predict which forum discussion thread instructors should intervene on. They utilised course information, forum structure, and post content; importantly, they also considered information on whether the next post to be written was by an instructor, hence enlisting characteristics of real instructor behaviour. Similarly, (Chandrasekaran et al. 2015b) built a classifier that considered prior knowledge of the forum type. These researchers used the Coursera platform and trained on historical instructor interventions. This approach, we argue, is inadequate, since (historical) instructor intervention likely resulted from a subjective decision to offer support. Moreover, it is arguably based on decisions on a subset of posts, because instructors may not have had sufficient time to read all the posts related to a particular course, to decide which were urgent (Chandrasekaran et al. 2015b).

The first research to use the Stanford MOOC Post dataset (Bakharia 2016) proposed a generalizable transfer-learning-based model to identify urgency as one of three forum-post classifications (confusion, urgency, and sentiment), by applying a cross-domain approach. Whilst the model failed to obtain adequate results, the author recommended transfer-learning as worthy of further research. Wei et al. (2017) followed the same cross-domain technique but applied a deep neural network element; this increased performance.

Almatrafi et al. (2018) also utilised the Stanford MOOC Post dataset to classify urgent posts, by training different shallow classifiers and proposing the best features for them. Sun et al. (2019) used instead deep learning, via improved recurrent convolutional neural networks (RCNN), achieving higher performance in identifying urgent posts compared to other models (naïve Bayes, SVM (RBF), random forest, CNN, RNN, LSTM, GRU, and RCNN). Another work by (Guo et al. 2019) proposed a hybrid neural network based on the attention mechanism to recognise urgent posts. With a similar goal, Alrajhi et al. 2020 used a multidimensional model to determine urgent posts requiring intervention, comparing two different models: (i) text-only, and (ii) text and numerical data. The findings highlight that the combined, multi-dimensional-features model is more effective than the text-only (NLP) analysis. Clavié and Gal (2019) created EduBERT, a contextualised word-embedding technique: it represents the current state-of-the-art performance on classifying urgent posts using EduDistilBERT (0.835 in Recall for the minority class). Khodeir (2021) built an urgency classification model, which is based on a fine-tuned BERT as an embedding layer feeding it into a multi-layer bi-directional GRU, and she reported their results based on three groups with (0.815, 0.847 and 0.831 in Recall), which is close to the state-of-the-art.

Another work that used the Stanford MOOC Post dataset for the intervention task (Capuano and Caballé, 2019) proposed a text categorisation tool for a multi-attribute categorisation of MOOC forum posts; one of these attributes is a level of urgency, with preliminary results to use for intervention, or as input for conversational agents (chatbots). Their follow-up study (Capuano et al. 2021) is an improvement of their tool, using attention-based hierarchical recurrent neural networks. However, their work classified urgency into three categories (low, medium and high) and reported an average recall (R), instead of the per class R. In addition, (Rossi et al. 2021) detected which type of pedagogical intervention is required, based on a conversational agent, using an ontology and a set of semantic rules.

Another study conducted by (Toti et al. 2020) built on the approach in (Capuano and Caballé, 2019); created a methodology to detect engagement in e-learning platforms and to help instructors with their timeliness of their interventions, based on different aspects, one of these being urgency, detected as a classification task; however, their work lacked the implementation.

The vast majority of recently published research on urgent post classification uses the Stanford MOOC Post dataset as the data source. However, even though this dataset is an excellent resource, it still represents just one platform; hence, research has to expand to others, to represent the current wide range of real-life MOOC environments, as different platforms have different structures and (acceptable) number of words per posts. To address this research gap and investigate other data sources, the present paper provides an analysis of the FutureLearn platform (which requires additional effort to complete the manual annotation). However, in common with the Stanford MOOC Post dataset, our dataset also suffers from similar disadvantages: identifying when instructor intervention is required from the massive number of posts in MOOC discussion forums is challenging for classifiers, due to the extremely limited number of urgent cases, which causes a highly imbalanced dataset. Moreover, as explained, correctly identifying the minority class (urgent comments) is the most important task. To date, and to the best of the researchers’ knowledge, no research has targeted the problem of dealing with highly imbalanced data in the context of intervention in MOOCs.

2.2 Text augmentation

The other branch of prior research relevant to our paper is using text augmentation in NLP. The aim of text augmentation is to expand data (Liu et al. 2020), by providing and applying a set of techniques that create synthetic data from an existing dataset (Shorten et al. 2021). The performance of model predictions on a number of NLP tasks can be enhanced by text augmentation, and it is preventing overfitting (Li et al. 2022). It is used to alleviate the issue of limited or scarce labelled training data (Anaby-Tavor et al. 2020), which leads to low accuracy and recall for the minority class (Liu et al. 2020).

The existing literature shows that previous researchers utilised NLP augmentation approaches; for example, (Wang and Yang 2015) applied text augmentation by performing synonym replacement and identified similar words based on lexical and semantic embedding. Another study by (Kobayashi 2018) proposed a new word-based approach for text augmentation based on contextual augmentation; they applied synonym replacement, by using a bi-directional predictive language model. Next, (Wei and Zou 2019) explored straightforward text editing techniques for augmentation, using one of four simple techniques (synonym replacement, random insertion, random swap and random deletion). Recent work by Xiang et. al. 9Xiang et al. 2020) proposed a part-of-speech-focused lexical substitution for data augmentation (PLSDA) to generate more instances via word substitution. Another augmentation work is applied in translation: (Yu et al. 2018) generated new data to enhance their training data using back-translation with two translation models: the first translates sentences from English-to-French, while the second translates from French-to-English.

Some researchers tackled augmentation by using text augmentation libraries (NLPAug) for specific tasks. Jungiewicz and Smywiński-Pohl (2020) used a range of augmentation techniques for sentiment analysis, including (NLPAug) based on BERT and WordNet. More recently, (Pereira et al. 2021) used the same BERT-based library and contextual word-embedding augmenter to generate more programming problem statements on a training dataset.

In our current paper, we also augment the text data based on the (NLPAug) library. Unlike in prior research, usually focusing on word level for augmented data, we used in several different levels (character, word, sentence). We apply different techniques based on word embedding: word2vec (using words as a target), contextual word embedding: BERT, DistilBERT, RoBERTa, and XLNet (using words or sentences as a target), and OCR engine error (using characters as a target). In addition, we create various pipelines based on sequential flow. We construct three different approaches because in textual augmentation, the best approach is based on the dataset; if any approach improved on performance for specific data, this may have been detrimental to other data (Qiu et al. 2020).

2.3 Adaptive models in MOOCs

In this section, we present related works on adaptation and adaptive models implemented in MOOCs. As MOOCs are a rather recent addition, with the term ‘MOOC’ coined in 2008 (Stracke and Bozkurt 2008), and the ‘year of the MOOCs’ only launching them in 2012 (Jordan and Goshtasbpour 2022), adaptation has been slow to be introduced to them, with most of them still being designed via the ‘one-size-fits-all’ paradigm (Shimabukuro 2016; Rizvi et al. 2022), to some extent in spite of the decades of research in adaptive educational hypermedia (Ahmadaliev et al. 2019), intelligent tutoring system (Mousavinasab et al. 2021; Hodgson et al. 2021), and the like. Nevertheless, a few researchers have started proposing adaptation in MOOCs. For instance, (Alzetta et al. 2018) designed a customised learning path in an interactive and mobile learning environment and in MOOCs using a Question/Answering (QA) system. Another work on adaptive models in MOOCs (Lallé and Conati 2020) created a framework for user modelling and adaptation (FUMA), as an adaptive support to learners’ during video usage. They used video watching and interaction behaviours as features, to reveal inactive learners. Another very recent work proposes an optimal learning path, to avoid MOOC learners from dropping out (SMAILI et al. 2022); they provide each learner an adaptive appropriate path, based on interaction with the environment, using particle swarm optimization (PSO).

In this research, unlike in previous research, we enable building adaptive models based on learner comments, with the aim is to improve communication with instructors.

3 Methodology

This study aims to automatically classify if a MOOC learner’s comment is urgent and so requires flagging for instructor intervention. This means modelling learner data (their comments) to recommend an action to the instructor (here, reply). We call this a fine-grained learner model, as each learner is represented by the set of their comments. More formally, we can write that for learner l1, their learner model L between time points t1 and t2 is given by:

$${\text{L}}\left( {{\text{l}}_{1} ,{\text{t}}_{1} ,{\text{t}}_{2} } \right) = \{ F_{{{\text{t}}({\text{c}}) \in \left[ {{\text{t}}1,{\text{t}}2} \right]}} \left( {{\text{urgency}}\left( {{\text{l}}_{1} ,{\text{c}}} \right)} \right)\}$$

where F(.) can be any function aggregating the urgency for a given interval (e.g. a sum of urgency), and urgency (l1,c) represents the fine-grained learner information at the level of a single comment c of learner l1, made during the given time interval [t1,t2]. This learner model L(.) is used to drive the recommendation to instructors (see Sect. 3.3). To achieve this objective, we manually annotate a FutureLearn corpus; we additionally use the highly popular and well-used benchmark Stanford dataset to validate our best model, thus demonstrating generalisability for our approach, and applicability across courses and domains. More information on these datasets can be found in Sect. 3.1.

To determine the appropriate method, we use NLP techniques to construct a diverse predictive model for text classification. We employ two main types of supervised classifiers:

  1. 1.

    A traditional machine learning approach, with handcrafted features as a baseline model;

  2. 2.

    A fine-tuned version of BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al. 2018), representing the latest advance in NLP at the time of writing, as a powerful supervised deep learning model.

To tackle the imbalance problem, several different techniques were employed (see Sect. 3.2.2). One technique that we consider is text augmentation; here, we rely on different approaches (see Text Augmentation in Sect. 3.2.2) and augment the minority-class data with various multipliers (such as 3X and 9X). The reason for using text augmentation is that it prevents overfitting; it is considered a crucial regularisation technique (Coulombe 2018).

3.1 Datasets

This research was conducted on the FutureLearn and Stanford MOOC-based platform datasets.

3.1.1 Building UNITE: a Futurelearn-based dataset

FutureLearn, a European MOOC learning platform, is based on a discussion in context approach; comments are attached at each course step in the discussion area, excluding steps for quizzes and exercises (Chua et al. 2017). We collected comments written and posted by learners on a Big Data course, as a case study. This course was provided by Warwick University, United Kingdom. We selected this course due to its richness in comments, popularity, and the novelty of the subject, which would likely include an adequate number of urgent comments. Then, we prepared the data and manually annotated the dataset with the help of human experts, to create the Gold standard MOOC Urgency Corpus, a hand-labelled dataset. This task proved to be quite challenging even for the human experts. This confirms the findings of previous researchers: (Chandrasekaran et al. 2015a) noted that it is difficult for humans to create such a gold standard data set via manually labelling individual cases requiring instructor intervention.

3.1.1.1 Creating the gold standard dataset (UNITE)

The corpus consists of 8263 comments (textual data in English) from the discussion forum of the above-mentioned course extracted over a 9-week period. Our research objectives are to classify urgent comments in discussion forums from the first half of the course, as our previous research indicated that most learners who dropped out were likely to do so in the early stages (Cristea et al. 2018; Alamri et al. 2019), so intervention, if required, would be likely be needed early on. In this regard, the following steps were taken to select suitable instances from the original data and prepare them for the annotation process. Learners’ comments from the first half (weeks 1 to 5) of the course were extracted, representing approximately half of the 9-week course. After this point, all instructor comments were excluded. This resulted in a total of 5790 comments.

The annotation process was performed independently and manually by four computer science experts, three working as instructors in the Department of Computer Science at a different university to the authors (Kwara State University, Nigeria); additionally, the first author of the present paper was involved in labelling. In creating the Gold standard MOOC Urgency Corpus, we took a similar approach as that used for creating the Stanford dataset as on (Agrawal and Paepcke)’s website (https://datastage.stanford.edu/StanfordMoocPosts/) and in their research (Agrawal et al. 2015). Specifically, a Likert scale from 1–7 was used to classify the urgency of the comments: a value of 1 indicates that no reason exists for the instructor to read the post, while a value of 7 indicates extreme urgency (as shown in Fig. 1); for more information, see Sect. 3.1.2.

Fig. 1
figure 1

The scale of urgency applied (1–7)

First, the data were pre-processed to exclude all comments with unmeaningful labels, such as (`, 44, 0 and empty); this left a total of 5786 comments.

Then, we validated and evaluated the quality of the manually labelled comments, by using the weighted Krippendorff's α (Antoine et al. 2014). The resulting agreement between annotators was low (α = 0.33); on the other hand, the Stanford dataset suffered partially from similar issues; the agreement between the optimal coder combination for the Likert variable (1–7) varies considerably per domain (Education: 0.14; Humanities/Sciences: 0.52; Medicine: 0.63).

Therefore, we first converted the (1–7) scale into a simplified (1–3) scale, as per Fig. 2. This meant, e.g. mapping 1, 2, 3 as non-urgent together—as they all are non-actionable, into (1). When recalculating the agreement, it remained, however, low (α = 0.31).

Fig. 2
figure 2

Dimensionality reduction: converting the (1–7) scale into a (1–3) scale

Thus, to be able to use the data reliably, we decided to identify a dependable sub-set; this sub-set was selected by including only comments that have a level of agreement between annotators of > 75%; in other words, at least 3 annotators (out of 4) must have agreed on the comment’s label. Thus, we used a voting method, which is considered the most appropriate way to integrate different opinions about the same task (Troyano et al. 2004). In this case, only 4622 reliable comments could be included in the gold standard dataset (approximately 80% of the original data).

As we aimed to obtain as many potentially urgent comments as possible, we framed the problem as a binary classification problem, with outputs Urgent and Non-urgent, by converting and ranking the gold standard labels as:

  • Scale = 2 or 3 → Urgent.

  • Scale = 1 → Non-urgent.

Figure 3 depicts the final gold standard labels generated for this research. Please note that we erred on the side of caution in this final step by including neutral comments (urgency = 4) as urgent. This is because, for the Stanford data (Sect. 3.1.2), while some researchers supposed that urgency ≥ 4 represents Urgent comments (Almatrafi et al. 2018), others regard urgency > 4 as Urgent (Guo et al. 2019). As here we were only working with integer values for labels, we considered that a value of 4 and above signifies Urgent. This is also in line with our protocol on favouring recall and false positive (FP).

Fig. 3
figure 3

Final gold standard labels for the UNITE corpus

Therefore, we define urgent comments as the comments that need response from instructors. In general, the urgent comments can be about some specific problem encountered, or other latent causes, such as frustration, lack of knowledge, and change in circumstances.

Unsurprisingly, for our UNITE dataset, this division still resulted in a very high proportion of the comments being categorised as non-urgent (93%, i.e. 4,292 comments; with only 330 urgent comments 7%), showing a high degree of imbalance.

3.1.2 Stanford MOOC post dataset

The Stanford dataset (Agrawal et al. 2015) is a gold standard dataset available to academic researchers on request. It contains a large number of English learner forum posts (29,604 in total), commenting on 11 Stanford University MOOCs across three different domains (Humanities/Sciences, Medicine, and Education). Humanities/Sciences include six courses, Medicine four courses, and Education one course. The forum-post annotation was performed by three independent human coders: each post was manually labelled on six dimensions (confusion, sentiment, urgency, question, answer and opinion). Scores for confusion, sentiment and urgency were scored from 1 (low) to 7 (high). Meanwhile, the scores for the other items, question, answer and opinion, were classified using a binary scale (0 or 1). For more information, see (Agrawal and Paepcke) website (https://datastage.stanford.edu/StanfordMoocPosts/).

Similar to UNITE, for the Stanford dataset, the data were pre-processed by removing unmeaningful comments. This resulted in a total of 29,597 comments.

In the Stanford dataset, the urgency score (i.e. how urgent is it that an instructor reads the post) ranged from 1 = non-urgent to 7 = very urgent as shown in Fig. 1. However, in the current paper, we followed the classification detailed in Sect. 3.1.1: we framed the problem of detecting urgency as a binary classification task, by converting urgency into a binary value:

  • Scale > 4 → Urgent.

  • Remainder → Non-urgent.

We set our scale to > 4, because in the Stanford dataset, the label-calculating method does not produce an integer (1/1.5/2/2.5/3/3.5/4/4.5/5/5.5/6/6.5/7). This is further supported by our previous findings (Alrajhi et al. 2020), where we found a correlation between specific values (4 and 4.5) for the sentiment and confusion scales.

Ultimately, across the whole dataset, non-urgent cases represented 81% (23,991 comments) and urgent cases represented 19% (5606 comments—varying between 3.2 and 37.6% within their 11 courses) (with urgent posts having urgency > 4).

3.2 Experiments for imbalanced data

To achieve a comprehensive understanding of the best way to automatically identifying the urgency of comments on MOOCs, we use, as mentioned, two common supervised machine learning strategies (traditional shallow ML and deep learning—with BERT) to automatically classify the comments. Additionally, as urgency-detection is a typically imbalanced data problem; hence, any MOOC provider would need to take imbalance into account—we experiment with various techniques to deal with input data, as per Fig. 4.

Fig. 4
figure 4

Our proposed pre-processing (data balancing) and ML pipeline combinations

First, we apply several training models to the original data on the gold standard UNITE corpus. Then, to improve performance, we design and develop three solutions to handle imbalanced data: (i) text augmentation; (ii) text augmentation + undersampling; and (iii) undersampling (see details in SubSect. 3.2.2). Text augmentation involves performing a range of approaches in different combinations, to augment the minority class in the training data. In undersampling, we randomly select instances from the majority class. Text augmentation + undersampling is a combinations of the two previous techniques. All the experiments were conducted using a stratified four fold cross-validation approach, to ensure representative results. The general architecture of the proposal classification model shown in Fig. 5 as we explained all the experiment in details in Sect. 3.2.

Fig. 5
figure 5

The general architecture of the classification model

3.2.1 Classifiers

As said, we compare two major classification model types to classify the comments: (i) shallow machine learning (a basic model typically used by machine learning algorithms), and (ii) BERT (one of the most popular transformer models, as further explained).

3.2.1.1 Shallow machine learning

We apply several machine learning models (see Fig. 6) to the classification task, each with different fundamental mechanisms for feature engineering, to capture the most effective features. This includes count vector and term frequency inverse document frequency (TF-IDF) to find an adequate classifier to predict urgent comments. We extract different feature sets via four different classical methods: (i) count vector; (ii) TF-IDF vector (word level); (iii) TF-IDF vector (n-gram word level); and (iv) TF-IDF vector (n-gram character level). Then, we build different popular classifiers across these different sets of features (naive Bayes, logistic regression, support vector machine, random forest, and boosting model—extreme (to become as gradient boosting (XGBoost)), as displayed in Fig. 6.

Fig. 6
figure 6

Our framework of the shallow ML classifiers using different features

We represent each comment with a specific vector; the count-vector counts the frequency of every given word in every comment. TF-IDF calculates the score of a numerical statistic to evaluate the extent of relatedness between a particular word and a specific comment in a collection of comments; it thus represents a measure of how important a word is in a collection of comments. Three different levels of TF-IDF were considered as tokens (word, n-gram word with range (2,3) and n-gram character with a range of (2,3)) with maximum features = 5000.

3.2.1.2 BERT

For deep learning, we employ the currently most popular and competitive approach in text classification tasks: BERT. Using BERT enabled us to avoid feature engineering, as well known for deep learning. We fine-tune a pre-trained ‘BERT-Base, Uncased- (L = 12, H = 768, A = 12, Total Parameters = 110M)’ version of the BERT classifier, which is the smaller model of the two available and was selected due to shorter training time, with one additional layer for the classification. For the BERT input, which is a sequence of tokens, we limit each comment to the final 128 tokens. This decision on the final tokens and size is based on various pre-experiment trials (final/first tokens; different sizes) that rendered this number (128 tokens) as the most suitable, encompassing most comments, with truncation only affecting 8% of UNITE, and 10% of the Stanford data. We use the Adam optimizer to tune BERT over 4 iterations.

3.2.2 Text balancing techniques

We developed several classifier models based on different techniques for manipulating the data. First, each of our models were run using the original gold standard corpus. Then, to tackle the imbalance problem, we independently applied the following approaches: (i) text augmentation; (ii) combined text augmentation then undersampling; and (iii) resampling using undersampling.

3.2.2.1 Original data usage (gold standard corpus)

As an initial experiment, we implemented all our models directly with original UNITE data. We split the dataset into four groups using stratified k-fold cross-validation, choosing a value of k = 4 (4 folds). We chose the k-fold cross-validation run approach because it allowed us to obtain results with less bias to specific data (Berrar 2019). We use stratification in the dataset: the selection of data led to an equal distribution of every class in every set. Thus, every fold contained the same percentage of samples from each class (see Fig. 7) as follows: training fold 3466 or 3467 samples (3219 as class 0, i.e. non-urgent, and 247 or 248 as class 1); testing fold 1156 or 1155 samples (1073 as 0 and 83 or 82 as 1) in each iteration see Table 1. Please note that we did not use the more frequently encountered ten fold validation, as, due to the very low number of urgent cases, this would have resulted in a too-low value per stratum for efficient stratification.

Fig. 7
figure 7

Splitting the data using k-fold cross-validation and stratification

Table 1 Number of cases for every class in (training, test) sets in each iteration: original data

For the training with BERT, we divide the training data into 90% training (0 = 2897, 1 = 222 or 223) and 10% validation (0 = 322 1 = 25), as well as use stratification.

However, we found the results unsatisfactory for the various classifiers, see Sect. 4. This we considered was due to class imbalance. To overcome this issue and enhance prediction performance, we next employ alternative techniques, as in the next sections.

3.2.2.2 Text augmentation

To manage the class imbalance and boost performance, we instead pre-process the data using artificial resampling (augmentation) to generate more minority-class cases for the training set of each fold, towards an almost balanced dataset. We augment every instance in the minority class into three and nine instances, respectively. We chose these values based on the literature reporting that for some databases, a low number of repetitions might not be sufficient to decrease the bias of the model in indiscriminately predicting the majority class; however, a higher repetition value might also render the data non-representative (Haixiang et al. 2017; Madabushi et al. 2020; Fonseca et al. 2020), so experimentation is necessary. Thus, in our work, we experiment, at every iteration, with the number of items in the training and test set for 3X and 9X augmentation, as in Table 2.

Table 2 Number of cases for every class in (training, test) sets in each iteration: text augmentation (3X–9X)

To achieve the augmentation goal, we apply common, easy-to-implement techniques for text augmentation, using the public (NLPAug) library. The text augmentation library (NLPAug) is a Python library dedicated to augmentation (Raghu and Schmidt 2020). We accessed simple code via the Edward Makcedward Github repository (Makcedward 2020). We use 3 different hybrid approaches: (i) word-level with the same type (BERT), (ii) word-level with different types; and (iii) different levels (character, word, sentence), as shown in Table 3.

Table 3 The approaches using different augmenters

In the first, we apply a hybrid approach that consists of three different actions (3X) in a ContextualWordEmbsAug augmenter based on BERT, by inserting and substituting with BERT and substituting with DistilBERT, to discover the most appropriate word for augmentation, as shown in Table 4.

Table 4 An example of different augmenters for 3X in the first approach on a comment in UNITE

Then, we build upon the (3X) method and increase the number of instances to (9X), by generating an additional 3X more instances for every instance. This is achieved by constructing six sequential pipelines, each representing a multi-augmenter (bi- or tri-augmenter), as shown in Table 5. Table 6 provides examples of 9X augmentation.

Table 5 Different pipelines to generate (9X) in the first approach
Table 6 An example of different augmenters for 9X in the first approach

Next, we conduct the second approach, another augmentation procedure, by mixing several augmenter functions based on word-level (see Table 3): WordEmbsAug (substitute word2vec) and ContextualWordEmbsAug (substitute BERT and substitute RoBERTa).

Last, as per Table 3, we construct the third approach, which is based on three different levels of augmenter (character, word and sentence). For character-level, we used OcrAug (a substitute for OCR). For word-level, we used ContextualWordEmbsAug (a substitute for BERT). For sentence-level, we used ContextualWordEmbsForSentenceAug (insert XLNet).

Then, we apply the shallow machine learning and BERT models, as explained in Sect. 3.2.1, based on 3X and 9X augmentations.

3.2.2.3 Text augmentation + undersampling

By creating nine new artificial instances in the training set, we obtain an almost-balanced dataset, albeit with a concern about its non-representativity. However, by creating three new instances, we moderately increase the data variation and perform a smaller move towards balancing the dataset. Hence, we address the concern of minimising model errors by frequently predicting the majority class, achieving instead high accuracy yet low recall and precision for the minority class. We deal with these two concerns by applying a hybrid resampling method combining this augmentation technique with undersampling.

In these experiments, we aim to balance the datasets by combining both text augmentation and undersampling methods as follows. First, by increasing instances to 3X or 9X in the minority class. Second, in undersampling, we randomly reduce the number of elements in the majority class to be equal to the minority class in every fold. Therefore, the numbers of samples for each pipeline in the urgent and non-urgent classes were approximately 990 for 3X and 2475 for 9X.

3.2.2.4 Undersampling (random)

To balance the class distribution in the original data, we performed an alternative popular method—the undersampling technique for imbalanced data classification—by randomly removing instances in the majority class. Thus, in this case, the numbers of samples for each class were 247 or 248.

3.2.2.5 Future learn and Stanford datasets

As explained, the distribution of the urgent class in the FutureLearn dataset (7%) was different than in the Stanford dataset (19%). Therefore, the effect of these different techniques to handle imbalanced data was expected to affect the performance results of the different datasets. Figures 8 and 9 show the distribution of every class in every fold for every method for UNITE and (3X) for the Stanford dataset, respectively.

Fig. 8
figure 8

The distribution of every class in every fold in every method for UNITE: our FutureLearn dataset

Fig. 9
figure 9

The distribution of every class in every fold for every method for the Stanford dataset

3.3 Illustration of adaptive intervention models

In this section, we introduce the design of illustrative adaptive intervention models for instructor interaction, based on our automatic urgency detection. These models showcase how the user model parameters proposed by this study can fit in simpler or, gradually, more complex user models; users here mean instructors, as primary target users, and learners, as potential secondary target users. Specifically, we provide two practical scenarios for semi-automatic instructor intervention: (1) semi-automatic intervention that tackles unbalanced data with a classification model. (2) filtering comments that improve instructor intervention, by filtering the results based on learners, their number of comments and time of posting the comment.

3.3.1 Semi-automatic instructor intervention: basic scenario

The first scenario introduces an artificial supporting instructor, as a pipeline incorporating the classification model, representing the learner model, using additional information on the instructor (the instructor model), as shown in Fig. 10.

Fig. 10
figure 10

The adaptive intervention model based on learners' comments; note how our predicted urgency becomes a (derived, fine-grained) learner model variable, together with the comments per learner

A basic instructor model would minimally contain variables such as the instructor time available for a specific session, and a time of reading per comment, or, alternatively, a maximum number of comments to read in that session (hence, a simple 2-variable user model for the instructor). The learner model contains 2-variables as well: comments of learners and urgency of comments at post-level (fine-grained). Based on this information, the adaptive intervention model can automatically retrieve the topmost urgent comments, depending on their ranking (e.g. from a probability score given by a classification model) and thus, reducing the overload on the instructor.

For example, instructor Laura has answered all yesterday's comments from learners. She wishes to know if there are any urgent comments today, as she has only 30 min, after which she needs to go to teach another course. All this information represents the instructor model. The MOOC webpage for today has 3 items, and each has acquired a total of, in average, 150 comments. She thinks that she would be able to answer about 10–15 comments maximum (and adds this information to her instructor modelFootnote 1). Thus, the artificial support instructor recommends Laura to answer the most urgent top 5 comments for each of the 3 items from today’s class. This recommendation represents the adaptive model, which is the combination of the classification model and our technique to deal with imbalanced data, automatically classifying posts and detecting urgent comments, adapting to the instructor’s needs, and helping Laura avoid reading all the comments, and improving her interaction with the learners.

3.3.2 Semi-adaptive instructor intervention: expanded scenario based on coarse granularity and expanded learner models

The first scenario deals with the recommended urgent comments, as per our pipeline proposed in this paper. However, this model can be further improved. Next, we show how comments can be grouped, to further refine the learner model, and deal with urgency at (higher granularity) learner level, instead of the comment level. This may show if a learner is generally in trouble and needs support, which may make dealing with that learner more stringent. That is consistent with findings in the study of (Alrajhi et al. 2021), which showed that learners write more comments overall when they require urgent intervention.

For example, instructor John wishes to use Laura’s system for classifying comments, but has noticed that learners tend to either send many urgent comments, when they are in trouble, or are overall happy, and thus, send fewer comments. He would like his load reduced and hence not answer to seemingly urgent comments coming from users with very few comments. Thus, he wishes learners to be grouped into urgent and non-urgent learner, as shown in Fig. 11. John will now be able to answer first to the urgent learners, even if some of the non-urgent learners may have posted comments sounding as urgent, but who may be less needing intervention.

Fig. 11
figure 11

Refining the learning modelling of urgency, based on two groups (non-urgent/urgent)

An extension to the learner model would be to add this coarse-grained, learner-level classification to the learner model, the number of comments, and then to further cluster them based on it. We compute the correlation between the number of written comments per learner versus the number of comments from these that need urgent intervention, using Pearson’s correlation. Therefore, we apply first Silhouette analysis, to check the number of clusters, then Fisher–Jenks algorithm (because we only work on one-dimensional data), perform the clustering. These clusters are then merged into 2 groups that differentiate between high number of comments learners and low number ones (urgent/non-urgent learners).

In addition, we further adapt the intervention based on the time stamp of the comments of each of these learners, to provide John with comments of the urgent learners, ordered first-come-first-served (FCFS). Thus, number of comments and time stamps are variables added to the extended learner model in this example. The overall adaptive model is summarised in Fig. 12, using the same instructor model as previously, but an expanded, three-variable learner model: coarse-grained learner-level urgency; with fine-grained post-level urgency; and learner comments.

Fig. 12
figure 12

The adaptive intervention model based on coarse-grained, expanded learner modelling, with two learner groups based on number of comments (low/high); here, the instructor model is the same as in Fig. 10, but the learner model has been expanded with an additional variable

4 Results

This section provides a discussion of our experimental results for the two main types of classifiers (shallow machine learning and deep learning) as well results related to our example adaptation intervention models.

4.1 Experiments for imbalanced data

4.1.1 Shallow machine learning on the UNITE dataset

In shallow machine learning, five different classifiers were tested with different types of feature engineering in different augmentation approaches (Approach #1: word-level, with the same type (BERT), Approach #2: word-level with different types (Word2vec, BERT and RoBERTa) and Approach #3: different levels (character, word, sentence) with different types (OCR, BERT and XLNet) as discussed in Sect. 3.2.2 under text augmentation heading. Table 7 shows the results of the comparison between the accuracy (ACC) of the basic classifier (naive Bayes) with count vector as features. Despite some of these models obtaining around 90% accuracy (see Table 7), this does not mean that they are good models; they could be biased towards the majority class on the imbalanced class dataset. Thus, to achieve more accurate results, we used other metrics to measure performance, such as Precision (P), Recall (R) and F1 Measure (F1) derived from the true positive (TP), true negative (TN), false positive (FP), and false negatives (FN) of the confusion matrix.

Table 7 Results of the Naive Bayes model with count-vector feature engineering with original data, with 3 approaches to augmentation (see Table 3) using 3X and 9X (Table 2) with and without undersampling and with undersampling without augmentation. Underline: best R for class 1 (Urgent), Bold: best performance, balancing between class 1 (Urgent) and class 0 (Non-urgent) in the UNITE dataset

This research project aims to correctly classify urgent cases, which is represented by Recall R. We propose to use Recall as the main evaluation metric for urgent comments, as Recall (the correct identification of most of the urgent cases, preferably all, allowing for false positives) ensures that all urgent cases have precedence, which is more important than Precision (the correct identification of only urgent cases, but possibly missing some, allowing for false negatives). Specifically, we try to improve the outcome of R for the positive class. In addition, we separately show how we can add a filtering process, to retrieve the most urgent comments that obtained priority from their probability on classification models. Thus, we potentially reduce the instructor effort required to review and read many comments.

Table 7 shows the count-vector feature as a case study. The evaluation of R for class 1 (urgent), based on the original data, was very low (0.05). We were able to improve performance, by applying different approaches to enhance the data and address the imbalance problem. The best result was obtained using undersampling (Under), which was a significant improvement (0.82; this improvement is statistically significant: Mann–Whitney U test: p < 0.05), but the results dramatically decreased for the class 0 to 0.49 (from 0.99). In contrast, the performance of the manipulated data with 3X augmentation + undersampling achieved the best performance, balancing between class 1 and class 0. Most of the three approaches for augmentation have the same scenario. There are some exceptions, which will be discussed later in this section.

Our aim is to find the best techniques to deal with the imbalanced data between three different approaches for augmentation (not to find the best feature engineering approach). The reason to use different features is to confirm which imbalanced data technique is better across all feature sets and to make our experiments more general. Therefore, we can generalise the findings to (a) all approaches on specific features, (b) all features on a specific classifier, and (c) all classifiers, since the effectiveness of the proposed methods of data manipulation was similar for most classifiers (as shown in Appendix B). We decided to report and discuss only one of these classifiers (naive Bayes) with one feature (count vector), for conciseness; the results of the other types of classifiers are provided in Appendix B. However, the exceptions are discussed in the next paragraph.

Whilst most of the findings are the same, there are a few exception cases; for example, (1) The strongest predictors for recall (R) were mostly those with undersampling (Under). However, some models (Random Forest and Boosting) with TF-IDF vectors (n-gram word level) are better in other approaches for text augmentation than undersampling (Table 8). (2) The best performance was often obtained from the data with 3X augmentation + undersampling, achieving a balance between class 1 and class 0 levels, but there are some models 9X augmentation + undersampling that outperform 3X augmentation + undersampling, as shown in Table 9. (3) in terms of approaches, the goal of building more than one approach was to generalise the results of the technique used in data manipulation. Thus, the results for different approaches reveal that there is no approach we can consider as best. However, interestingly, approach 3, which is based on different levels (character, word, sentence), always has the best results for R if we use TF-IDF vectors (n-gram character level) as a feature, across all experiments (as shown in Appendix B).

Table 8 Cases in which the results of the text augmentation techniques are higher than the results of undersampling technique
Table 9 Cases in which the results of the 9X augmentation + undersampling are higher than the results of 3X augmentation + undersampling

4.1.2 BERT on the UNITE dataset

When using BERT, Table 10 shows the prediction performance for the different methods of manipulating the data. As mentioned, only augmentation was performed; no feature engineering was necessary. The performance of R for class 1 in BERT with the original data was not too low in comparison with the shallow machine learning results. Although rose from (0.52) to 0.82 with the undersampling technique, this difference is statistically significant (Mann–Whitney U test: p < 0.05). However, for the negative class, recall decreased from 0.98 to 0.86. To achieve more balance between the two classes, we used 3X augmentation + undersampling (see Table 10).

Table 10 Results of the BERT model with original data, with 3 approaches to augmentation (see Table 3) using 3X and 9X (Table 2) with and without undersampling and with undersampling without augmentation. Underline: best R for class 1 (Urgent), Bold: best performance, balancing between class 1 (Urgent) and class 0 (Non-urgent) in the UNITE dataset

Hence, the best classifier performance on the UNITE dataset is BERT with the ‘approach 3 with 3X augmentation + undersampling’.

To verify the effectiveness of the different data manipulation techniques to deal with the imbalanced data problem, we utilised the same methods on the Stanford dataset. In these experiments, we limited augmentation to 3X only, since 9X would have generated more instances in the minority class than in the majority class. We also applied only ‘approach 3’ (see Table 3), which provided the best performance for the (3X augmentation + undersampling) technique on the UNITE dataset.

4.1.3 BERT on the Stanford dataset

Table 11 shows the results of BERT on the Stanford dataset. We obtained similar results for the UNITE data; the only difference being in the performance of the two techniques with 3X augmentation with and without undersampling. This is possibly because the distribution of non-urgent cases differs between the two datasets (see Figs. 8, 9). Whereas, as we clarified in Fig. 9, the distribution of non-urgent cases for 3X is almost the same as the distribution of non-urgent cases for (3X + undersampling).

Table 11 Results of the BERT model with original data, with 3 approaches to augmentation (see Table 3) using 3X (Table 2) with and without undersampling and with undersampling without augmentation. Underline: best R for class 1 (Urgent), Bold: best performance, balancing between class 1 (Urgent) and class 0 (Non-urgent) in the Stanford dataset

4.2 Adaptive intervention models

4.2.1 Basic adaptation scenario

In this scenario, depending on urgent comments ranking (probability score given by the classification model), the aim is for the adaptive intervention model to automatically retrieve the most important urgent comments and reduce the number of comments that are read by instructor. In this case, we used naïve Bayes with count vector, using approach #1 for 3X augmentation with undersampling model (the best performance among different approaches in naïve Bayes with count vector) as a case study. For example, if the time available is limited to read 5 comments, then the model will retrieve only 5 comments. Table 12 presents the results of the comparison between the basic model (all comments) and the adaptive model that selects just the top urgent (5) comments for the urgent class (1), which clearly outperforms the basic model on all evaluation criteria.

Table 12 Results of the Naive Bayes model with count vector as a feature engineering with first approaches to augmentation (see Table 3) using 3X with undersampling. First row: basic model with all data, second row: filtering model with top urgent 5 comments for the urgent class ‘1’ in the UNITE dataset

4.2.2 Expanded adaptation scenario

For the second scenario, we proposed an adaptation filtering model based on the number of learner comments. We use Pearson's correlation to calculate the correlation between the number of written comments per learner and the number of comments from those that require immediate attention. This process resulted in a strong correlation (0.65).

The results of Fisher–Jenks algorithm to cluster learners are shown in Table 13. To obtain the two groups (for urgent/ non-urgent learners), we then merge clusters 1 and 2, to reflect the learners with a high number of comments, as these are significantly more communicative than learners in cluster 0.

Table 13 Clustering learners based on their number of comments

We remove comments from the low number of comments group (non-urgent learners) from each fold (using stratified fourfold cross-validation). The number of comments on the test set is shown in Table 14 for both: basic model, which contains all learners; and the filtering model, which only contains learners with a high number of comments (urgent learners). Hence, the number of comments in the filtering model is much lower than for the basic model. For example, in fold 1 it dropped from (1156 to 533) basic to filtering, reducing the number of comments the instructor needs to read. Thus, whilst the overall recall is somewhat reduced (by 11%), the load of the instructor is significantly (p <  < 0.5) reduced as well.

Table 14 Number of comments on Test Set; First row: basic models, second row: filtering models in the UNITE dataset

5 Error analysis

We conducted an in-depth re-analysis of our model to understand the reasons for the errors obtained in the test set in every fold. For this purpose, we manually inspected the examples of mistakes that our best algorithm (BERT—Text Augmentation + Undersampling) made on UNITE data. Specifically, false negatives (FN), which the model categorised as non-urgent (although they are labelled as urgent), were considered to be the most critical errors, as our aim was to capture all urgent cases. To put the results and especially the errors in context, we compared the miss-predictions of the classifier with human-level performance for the different folds (using stratified k-fold cross-validation, choosing a value of k = 4 (4 folds) as we explained in methodology under SubSect. 3.2.2. The results are shown in Table 15.

Table 15 FN results for the best algorithm versus disagreement between human annotators

From the results, we found that most of the FN cases were also mirrored in the disagreement between annotators (i.e. for 19/23 false negatives misclassified by our classifier, the human annotators also disagreed for fold 1, etc., see Table 15). This further supports the notion that decision-making among annotators is difficult, as well as that the more difficult cases are both hard for humans and classifiers to categorise, examples of each fold are shown in Table 16.

Table 16 Anonymised examples of FN results and disagreement between human annotators on UNITE data

From Table 16, we can better understand why humans and ML struggle in certain cases. For example, in fold 1, the learner does not understand the diagram, but s/he is happy about providing a meaning and context for the words used in analysis. Some annotators believe that this comment is not urgent, because the learner did not request assistance. However, another annotator may find that the learner has difficulty in understanding the concept. Such clashes may explain why the model was not able to detect the above-mentioned urgent cases.

6 Discussion, limitations and future work

In MOOC environments, detecting the urgent cases is a critical issue. As per the nature of MOOCs, urgent cases are rare, compared to non-urgent ones, which leads to unbalanced data. Also, the other issue in MOOCs is that the intervention in past researches (Almatrafi et al. 2018; Guo et al. 2019) follows an one-size-fits-all approach, without any personalisation in the intervention based on learners, in spite of long-term personalisation research in education.

In this paper, our aim is to propose a solution for unbalanced data based on MOOCs and adapt and improve the system interaction by automating urgency detection that would enable an instructor to decide when to react, so adapting the timing of interventions to the urgency detected in the learner. The potential beneficiaries are MOOC providers, then instructors on MOOCs, then the learners.

The ultimate goal of this work is the last step, where we personalise (by automatic adaptation) the identification process of urgent comments in MOOCs for the instructors, our primary users. This means we are building ‘interactive computer systems that can be adapted or adapt themselves to their current users’, adapting to the needs of our primary users, the instructors, to manage their workload, as well as, indirectly, to the needs of our secondary users, the learners, to have their urgent messages identified (and ultimately, answered).

Urgency in intervention is an interesting area, raising the question: how dependent is urgency on the learner? for instance, for the latter, it is possible that another learner has already dealt with the urgent question. So, showing the full thread to the instructor to inspect is also useful. Here, we look first at fine-grained learner modelling, where we consider each comment as a feature of a learner, that needs, if urgent, to be dealt with on its own. Next, as research has revealed correlations between urgency and number of comments, showing that learners posting urgent comments are likely to post many of them, hence being able to be classified, at the macro-scale, as an ‘urgent learner’ (Alrajhi et al. 2021), we also propose coarse-grained learner modelling, where learners are grouped as either urgent-learners or non-urgent learners. Such learners would need to be treated with priority by the instructors.

Modelling learners based on comments only is a simplification of the learner model, as in any model is a simplification of the world. However, we believe that the comments of learners can provide insight into some of the learner characteristics and needs. For instance, the language of the comment can show anxiety, or a certain level of background knowledge, or impatience, thus covering various learner model variables. It is, however, possible some learners are missed this way; moreover, learners that refuse to engage with comments will not be identified via these methods.

Learner models can contain several parameters, and be simpler or richer. Indeed, learner models can reflect various aspects of a learner, and they often include various parameters, such as current level of confusion, motivation, and understanding. Interventions to reduce drop out of the learner from the MOOC could include changing the difficulty or type of problems, referring the learner to modules for missing prerequisite knowledge, peer referrals, encouraging communications, etc. In this paper, we add to this rich tapestry of user model dimensions, by extracting urgency based directly on user comments, which have not been considered before in user modelling. Importantly, we consider comment-based user modelling rich, in the sense that they may reflect various aspects of a learner—boredom, interest, knowledge, fluency, etc. Our learner model can be used by itself, or in conjunction with other parameters of the users (if known), and thus, further enrich the user model. This, however, does not detract from the merit of the parameters we introduce with our approach.

The main limitation is that the automatic classification in general and the potential solution for unbalanced data may not be general enough for all online courses and platforms, as it has been applied on only one specific course in FutureLearn. However, as showed in Sect. 3, we have further validated our best solution on the highly popular and well-used Stanford dataset, thus strengthening the case for generalisability for our approach, and applicability across courses and domains.

There are numerous opportunities for future work such as: exploiting other features, like the number of posts in a thread; while this may not directly tell us if an individual post is urgent, we could analyse numbers per topic, or per learner, etc. Such multitude of posts may only reflect, however, a very popular topic, or a very prolific learner. Interestingly, analysing the correlation between FutureLearn ‘likes’ of posts and their urgency showed no correlation between them. Moreover, as not all posts have ‘likes’, our current approach is more generalizable. Also, an urgency lexicon-based method, based on the identification of key terms (keywords or n-grams) that could indicate urgency may be considered; however, current deep learning methods are known to outperform lexicon-based ones. Another interesting approach would be to increase priority in urgent case based on number of learners; i.e. if an issue is raised by many learners, it could be considered of higher urgency. In addition, peer reactions could indeed be taken into account both in terms of declaring a problem solved (i.e. a peer has answered it) or generating a flurry of responses thus being very urgent (as many in the class would struggle with the same issue).

Additionally, detection of learner affective states may allow (artificial) instructors to adapt their support to those states. Furthermore, labelling data based on sentiment analysis, and especially confusion/frustration perspective, as a supervised, but, more interestingly, as an unsupervised method, may be a cheaper way to detect urgency—however, may raise challenges in terms of accuracy. Our data were already labelled for urgency, regardless of the original cause—frustration, lack of knowledge, change in circumstances, etc. Thus, we believe our approach to be more generic, as it encompasses all these reasons or other latent causes. Please finally also note that urgency is a relative concept; we have addressed some of these aspects in this paper, within its definition (Sect. 3.1.1); however, further work can look into refining or specialising its definition.

Finally, whilst our results are very specific to MOOC comment analysis, our techniques may serve as a template for other similar NLP classification tasks using machine learning with severely skewed datasets.

7 Conclusion

On MOOC platforms, deciding the right moment for instructor intervention is an important challenge to be overcome to better support learners and lower drop-out rates. Building an automated model to detect comments that require urgent intervention represents a promising solution to this problem. However, the available comment datasets naturally contain only a few urgent cases, leading to imbalanced data, which explains the difficulty in creating models to detect such cases accurately. In this work, we analysed and compared three techniques (text augmentation, text augmentation + undersampling, and undersampling) to improve the quality of such data. Also, we provided several new pipelines incorporating different text augmenters. Our results show that an increase in model performance can be obtained via undersampling, and a combination of text augmentation + undersampling achieves the best performance in balancing between the two classes.

These results help in retrieving the most urgent comments for instructors. To show how this can be applied, we have illustrated it with two adaptive models, based on two types of user models: (1) personalised instructor intervention based on a fine-granularity learner model and (2) filtering results based on a higher granularity learner model.

We further inspected wrongly classified urgent instances and found that the problem does not simply lie with the classifier: it also stems from the data, which humans also find difficult to annotate. This indicates that the difficulties faced by human annotators in classifying such commentsare also faced by these models.

Additionally, whilst the majority of previous works on instructor intervention were based on the Stanford corpus, in this research, we used the FutureLearn platform featuring a total of 5790 comments annotated by human experts, to form the new UNITE corpus.