Solving the imbalanced data issue: automatic urgency detection for instructor assistance in MOOC discussion forums

Alrajhi, Laila; Alamri, Ahmed; Pereira, Filipe Dwan; Cristea, Alexandra I.; Oliveira, Elaine H. T.

doi:10.1007/s11257-023-09381-y

Solving the imbalanced data issue: automatic urgency detection for instructor assistance in MOOC discussion forums

Open access
Published: 01 December 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

User Modeling and User-Adapted Interaction Aims and scope Submit manuscript

Solving the imbalanced data issue: automatic urgency detection for instructor assistance in MOOC discussion forums

Download PDF

Laila Alrajhi^1,2,
Ahmed Alamri³,
Filipe Dwan Pereira⁴,
Alexandra I. Cristea¹ &
…
Elaine H. T. Oliveira⁵

568 Accesses
1 Citation
Explore all metrics

Abstract

In MOOCs, identifying urgent comments on discussion forums is an ongoing challenge. Whilst urgent comments require immediate reactions from instructors, to improve interaction with their learners, and potentially reducing drop-out rates—the task is difficult, as truly urgent comments are rare. From a data analytics perspective, this represents a highly unbalanced (sparse) dataset. Here, we aim to automate the urgent comments identification process, based on fine-grained learner modelling—to be used for automatic recommendations to instructors. To showcase and compare these models, we apply them to the first gold standard dataset for Urgent iNstructor InTErvention (UNITE), which we created by labelling FutureLearn MOOC data. We implement both benchmark shallow classifiers and deep learning. Importantly, we not only compare, for the first time for the unbalanced problem, several data balancing techniques, comprising text augmentation, text augmentation with undersampling, and undersampling, but also propose several new pipelines for combining different augmenters for text augmentation. Results show that models with undersampling can predict most urgent cases; and 3X augmentation + undersampling usually attains the best performance. We additionally validate the best models via a generic benchmark dataset (Stanford). As a case study, we showcase how the naïve Bayes with count vector can adaptively support instructors in answering learner questions/comments, potentially saving time or increasing efficiency in supporting learners. Finally, we show that the errors from the classifier mirrors the disagreements between annotators. Thus, our proposed algorithms perform at least as well as a ‘super-diligent’ human instructor (with the time to consider all comments).

Intervention Prediction in MOOCs Based on Learners’ Comments: A Temporal Multi-input Approach Using Deep Learning and Transformer Models

Exploring Bayesian Deep Learning for Urgent Instructor Intervention Need in MOOC Forums

Urgency Analysis of Learners’ Comments: An Automated Intervention Priority Model for MOOC

1 Introduction

Massive Open Online Course (MOOC) platforms continue to grow dramatically; in recent years, Coursera, edX and FutureLearn have emerged as popular platforms (Joseph 2020). They support a wide variety of efficiently delivered courses with easy access, which open numerous learning opportunities to anyone wanting to learn a specific topic or obtain new information. As the popularity of MOOCs continues to grow, ever more people with a broad diversity of background knowledge and goals are enrolling on these platforms from around the world. For over a decade now, many research communities have contributed to the development of these platforms and proposed specific solutions to the challenges and barriers they face, such as learner engagement (Anderson et al. 2014), learner motivation (Durksen et al. 2016) and learner performance (Jiang et al. 2014). One way to lift some of these barriers is via accurate and on-time instructor intervention on MOOC discussion forums (Alrajhi et al. 2021).

Discussion forums enable online learners to express their ideas, ask questions, and seek help (Crossley et al. 2015). In addition, they create social connections and facilitate communication among learners and instructors (Stump et al. 2013). In this context, instructors play an important role in monitoring comments, especially to provide the help and support learners need: an accurate on-time intervention may make the difference between a learner continuing on the course or dropping out; indeed, (Alrajhi et al. 2021) showed that learners are less likely to finish the course (about 13%) if they frequently make comments that require intervention. However, continuously monitoring such huge numbers of comments is a time-consuming and sometimes overwhelming task for instructors: hundreds or even thousands of comments are posted during courses, sometimes for each course step, and identifying those which require urgent intervention can be almost impossible. This is exacerbated by the high ratio of learners to instructors (Almatrafi and Johri 2018).

As MOOCs generate huge amounts of textual data, another way of addressing this issue is via Natural Language Processing (NLP). The work presented in this paper uses NLP to help instructors to address urgent comments and enable them to decide when to react, by creating an automatic text-classification model.

Another core problem in this area is the intrinsically imbalanced nature of the data; such datasets are characterised by a highly skewed class distribution due to the (naturally) small number of ‘urgent comment’ instances. In text classification tasks, performance often depends on the quality of the data (Wei and Zou 2019). Therefore, to tackle the imbalanced data problem and improve the size and quality of the training data, we manipulate the dataset by: text augmentation, text augmentation with undersampling, and undersampling.

To illustrate the usage of the fine-grained learner models in adaptive support for instructor intervention, we describe an adaptation case where instructors can decrease their workload by using one of our models. We also showcase an expanded model that uses more extensive learner knowledge (based on the number of comments per learner), to discuss how such adaptation models can be further expanded.

1.1 Contributions

The main contributions of our research are, to the best of our knowledge:

Creating the first learner, instructor and adaptation models to support instructors to deal with urgent comments in MOOCs.
For the first time in the literature, applying data balancing techniques for shallow and deep machine learning to identify instances when urgent instructor intervention is required on MOOCs. These techniques include text augmentation, text augmentation with undersampling, and undersampling to overcome the imbalanced data problem and improve performance. This is achieved by ‘forcing’ the algorithm to increase the weight of the minority class.
Creating the first gold standard corpus MOOC Urgent iNstructor InTErvention (UNITE) for instructor intervention in MOOC environments (the FutureLearn platform), which has been annotated by carefully selected experts in the field. This will be made available (after ethical cleansing) to the research community.
Proposing several new pipelines (3X and 9X) to generate more data for text augmentation by incorporating different NLP augmenters and providing a range of approaches.
Showcasing the challenges and difficulties involved in instructor-intervention decisions in MOOC environments, by manually inspecting and analysing the (relatively small) set of errors generated by the best classifier, along with the best data balancing and text augmentation solutions.

2 Literature review

Today, the instructor intervention problem is one of the most challenging in MOOC environments. Separately, a related, even less explored area of research has emerged, identifying the difficult area of urgent posts detection (Almatrafi et al. 2018; Guo et al. 2019; Alrajhi et al. 2020; Khodeir 2021). However, an obvious omission is that for urgent posts, imbalanced data are a characteristic of the data itself (as there are less urgent comments than non-urgent, normally). This fact has been overlooked in urgent post detection. The closest research to this (Almatrafi et al. 2018; Khodeir 2021) considered some standard techniques: splitting the data, training the model, and selecting the evaluation metrics, but not dealing with improve data imbalance. In addition, while available intervention models for urgent comments concentrated on classifying posts, they did not pay any attention to the behaviours of learners or designed adaptive instructor intervention models based on learner (or instructor) models. Therefore, this section reviews the literature areas closest to our proposal: (1) the important area of instructor intervention in MOOCs, focusing on urgent posts, (2) the area of text augmentation, specifically for balancing data, and (3) adaptive models in MOOCs.

2.1 Instructor intervention in MOOC forums

In 2012, the first MOOC-like discussion forums were developed and immediately aroused researchers’ interest. According to (Almatrafi and Johri 2018), 234 researchers inspected discussion forums from 2013 to 2018. Only as recently as 2014, (Chaturvedi et al. 2014) first investigated the intervention problem, by building numerous models to predict which forum discussion thread instructors should intervene on. They utilised course information, forum structure, and post content; importantly, they also considered information on whether the next post to be written was by an instructor, hence enlisting characteristics of real instructor behaviour. Similarly, (Chandrasekaran et al. 2015b) built a classifier that considered prior knowledge of the forum type. These researchers used the Coursera platform and trained on historical instructor interventions. This approach, we argue, is inadequate, since (historical) instructor intervention likely resulted from a subjective decision to offer support. Moreover, it is arguably based on decisions on a subset of posts, because instructors may not have had sufficient time to read all the posts related to a particular course, to decide which were urgent (Chandrasekaran et al. 2015b).

The first research to use the Stanford MOOC Post dataset (Bakharia 2016) proposed a generalizable transfer-learning-based model to identify urgency as one of three forum-post classifications (confusion, urgency, and sentiment), by applying a cross-domain approach. Whilst the model failed to obtain adequate results, the author recommended transfer-learning as worthy of further research. Wei et al. (2017) followed the same cross-domain technique but applied a deep neural network element; this increased performance.

Almatrafi et al. (2018) also utilised the Stanford MOOC Post dataset to classify urgent posts, by training different shallow classifiers and proposing the best features for them. Sun et al. (2019) used instead deep learning, via improved recurrent convolutional neural networks (RCNN), achieving higher performance in identifying urgent posts compared to other models (naïve Bayes, SVM (RBF), random forest, CNN, RNN, LSTM, GRU, and RCNN). Another work by (Guo et al. 2019) proposed a hybrid neural network based on the attention mechanism to recognise urgent posts. With a similar goal, Alrajhi et al. 2020 used a multidimensional model to determine urgent posts requiring intervention, comparing two different models: (i) text-only, and (ii) text and numerical data. The findings highlight that the combined, multi-dimensional-features model is more effective than the text-only (NLP) analysis. Clavié and Gal (2019) created EduBERT, a contextualised word-embedding technique: it represents the current state-of-the-art performance on classifying urgent posts using EduDistilBERT (0.835 in Recall for the minority class). Khodeir (2021) built an urgency classification model, which is based on a fine-tuned BERT as an embedding layer feeding it into a multi-layer bi-directional GRU, and she reported their results based on three groups with (0.815, 0.847 and 0.831 in Recall), which is close to the state-of-the-art.

Another work that used the Stanford MOOC Post dataset for the intervention task (Capuano and Caballé, 2019) proposed a text categorisation tool for a multi-attribute categorisation of MOOC forum posts; one of these attributes is a level of urgency, with preliminary results to use for intervention, or as input for conversational agents (chatbots). Their follow-up study (Capuano et al. 2021) is an improvement of their tool, using attention-based hierarchical recurrent neural networks. However, their work classified urgency into three categories (low, medium and high) and reported an average recall (R), instead of the per class R. In addition, (Rossi et al. 2021) detected which type of pedagogical intervention is required, based on a conversational agent, using an ontology and a set of semantic rules.

Another study conducted by (Toti et al. 2020) built on the approach in (Capuano and Caballé, 2019); created a methodology to detect engagement in e-learning platforms and to help instructors with their timeliness of their interventions, based on different aspects, one of these being urgency, detected as a classification task; however, their work lacked the implementation.

The vast majority of recently published research on urgent post classification uses the Stanford MOOC Post dataset as the data source. However, even though this dataset is an excellent resource, it still represents just one platform; hence, research has to expand to others, to represent the current wide range of real-life MOOC environments, as different platforms have different structures and (acceptable) number of words per posts. To address this research gap and investigate other data sources, the present paper provides an analysis of the FutureLearn platform (which requires additional effort to complete the manual annotation). However, in common with the Stanford MOOC Post dataset, our dataset also suffers from similar disadvantages: identifying when instructor intervention is required from the massive number of posts in MOOC discussion forums is challenging for classifiers, due to the extremely limited number of urgent cases, which causes a highly imbalanced dataset. Moreover, as explained, correctly identifying the minority class (urgent comments) is the most important task. To date, and to the best of the researchers’ knowledge, no research has targeted the problem of dealing with highly imbalanced data in the context of intervention in MOOCs.

2.2 Text augmentation

The other branch of prior research relevant to our paper is using text augmentation in NLP. The aim of text augmentation is to expand data (Liu et al. 2020), by providing and applying a set of techniques that create synthetic data from an existing dataset (Shorten et al. 2021). The performance of model predictions on a number of NLP tasks can be enhanced by text augmentation, and it is preventing overfitting (Li et al. 2022). It is used to alleviate the issue of limited or scarce labelled training data (Anaby-Tavor et al. 2020), which leads to low accuracy and recall for the minority class (Liu et al. 2020).

The existing literature shows that previous researchers utilised NLP augmentation approaches; for example, (Wang and Yang 2015) applied text augmentation by performing synonym replacement and identified similar words based on lexical and semantic embedding. Another study by (Kobayashi 2018) proposed a new word-based approach for text augmentation based on contextual augmentation; they applied synonym replacement, by using a bi-directional predictive language model. Next, (Wei and Zou 2019) explored straightforward text editing techniques for augmentation, using one of four simple techniques (synonym replacement, random insertion, random swap and random deletion). Recent work by Xiang et. al. 9Xiang et al. 2020) proposed a part-of-speech-focused lexical substitution for data augmentation (PLSDA) to generate more instances via word substitution. Another augmentation work is applied in translation: (Yu et al. 2018) generated new data to enhance their training data using back-translation with two translation models: the first translates sentences from English-to-French, while the second translates from French-to-English.

Some researchers tackled augmentation by using text augmentation libraries (NLPAug) for specific tasks. Jungiewicz and Smywiński-Pohl (2020) used a range of augmentation techniques for sentiment analysis, including (NLPAug) based on BERT and WordNet. More recently, (Pereira et al. 2021) used the same BERT-based library and contextual word-embedding augmenter to generate more programming problem statements on a training dataset.

In our current paper, we also augment the text data based on the (NLPAug) library. Unlike in prior research, usually focusing on word level for augmented data, we used in several different levels (character, word, sentence). We apply different techniques based on word embedding: word2vec (using words as a target), contextual word embedding: BERT, DistilBERT, RoBERTa, and XLNet (using words or sentences as a target), and OCR engine error (using characters as a target). In addition, we create various pipelines based on sequential flow. We construct three different approaches because in textual augmentation, the best approach is based on the dataset; if any approach improved on performance for specific data, this may have been detrimental to other data (Qiu et al. 2020).

2.3 Adaptive models in MOOCs

In this section, we present related works on adaptation and adaptive models implemented in MOOCs. As MOOCs are a rather recent addition, with the term ‘MOOC’ coined in 2008 (Stracke and Bozkurt 2008), and the ‘year of the MOOCs’ only launching them in 2012 (Jordan and Goshtasbpour 2022), adaptation has been slow to be introduced to them, with most of them still being designed via the ‘one-size-fits-all’ paradigm (Shimabukuro 2016; Rizvi et al. 2022), to some extent in spite of the decades of research in adaptive educational hypermedia (Ahmadaliev et al. 2019), intelligent tutoring system (Mousavinasab et al. 2021; Hodgson et al. 2021), and the like. Nevertheless, a few researchers have started proposing adaptation in MOOCs. For instance, (Alzetta et al. 2018) designed a customised learning path in an interactive and mobile learning environment and in MOOCs using a Question/Answering (QA) system. Another work on adaptive models in MOOCs (Lallé and Conati 2020) created a framework for user modelling and adaptation (FUMA), as an adaptive support to learners’ during video usage. They used video watching and interaction behaviours as features, to reveal inactive learners. Another very recent work proposes an optimal learning path, to avoid MOOC learners from dropping out (SMAILI et al. 2022); they provide each learner an adaptive appropriate path, based on interaction with the environment, using particle swarm optimization (PSO).

In this research, unlike in previous research, we enable building adaptive models based on learner comments, with the aim is to improve communication with instructors.

3 Methodology

This study aims to automatically classify if a MOOC learner’s comment is urgent and so requires flagging for instructor intervention. This means modelling learner data (their comments) to recommend an action to the instructor (here, reply). We call this a fine-grained learner model, as each learner is represented by the set of their comments. More formally, we can write that for learner l₁, their learner model L between time points t₁ and t₂ is given by:

$${\text{L}}\left( {{\text{l}}_{1} ,{\text{t}}_{1} ,{\text{t}}_{2} } \right) = \{ F_{{{\text{t}}({\text{c}}) \in \left[ {{\text{t}}1,{\text{t}}2} \right]}} \left( {{\text{urgency}}\left( {{\text{l}}_{1} ,{\text{c}}} \right)} \right)\}$$

where F(.) can be any function aggregating the urgency for a given interval (e.g. a sum of urgency), and urgency (l₁,c) represents the fine-grained learner information at the level of a single comment c of learner l₁, made during the given time interval [t₁,t₂]. This learner model L(.) is used to drive the recommendation to instructors (see Sect. 3.3). To achieve this objective, we manually annotate a FutureLearn corpus; we additionally use the highly popular and well-used benchmark Stanford dataset to validate our best model, thus demonstrating generalisability for our approach, and applicability across courses and domains. More information on these datasets can be found in Sect. 3.1.

To determine the appropriate method, we use NLP techniques to construct a diverse predictive model for text classification. We employ two main types of supervised classifiers:

1.
A traditional machine learning approach, with handcrafted features as a baseline model;
2.
A fine-tuned version of BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al. 2018), representing the latest advance in NLP at the time of writing, as a powerful supervised deep learning model.

To tackle the imbalance problem, several different techniques were employed (see Sect. 3.2.2). One technique that we consider is text augmentation; here, we rely on different approaches (see Text Augmentation in Sect. 3.2.2) and augment the minority-class data with various multipliers (such as 3X and 9X). The reason for using text augmentation is that it prevents overfitting; it is considered a crucial regularisation technique (Coulombe 2018).

3.1 Datasets

This research was conducted on the FutureLearn and Stanford MOOC-based platform datasets.

3.1.1 Building UNITE: a Futurelearn-based dataset

FutureLearn, a European MOOC learning platform, is based on a discussion in context approach; comments are attached at each course step in the discussion area, excluding steps for quizzes and exercises (Chua et al. 2017). We collected comments written and posted by learners on a Big Data course, as a case study. This course was provided by Warwick University, United Kingdom. We selected this course due to its richness in comments, popularity, and the novelty of the subject, which would likely include an adequate number of urgent comments. Then, we prepared the data and manually annotated the dataset with the help of human experts, to create the Gold standard MOOC Urgency Corpus, a hand-labelled dataset. This task proved to be quite challenging even for the human experts. This confirms the findings of previous researchers: (Chandrasekaran et al. 2015a) noted that it is difficult for humans to create such a gold standard data set via manually labelling individual cases requiring instructor intervention.

3.1.1.1 Creating the gold standard dataset (UNITE)

The corpus consists of 8263 comments (textual data in English) from the discussion forum of the above-mentioned course extracted over a 9-week period. Our research objectives are to classify urgent comments in discussion forums from the first half of the course, as our previous research indicated that most learners who dropped out were likely to do so in the early stages (Cristea et al. 2018; Alamri et al. 2019), so intervention, if required, would be likely be needed early on. In this regard, the following steps were taken to select suitable instances from the original data and prepare them for the annotation process. Learners’ comments from the first half (weeks 1 to 5) of the course were extracted, representing approximately half of the 9-week course. After this point, all instructor comments were excluded. This resulted in a total of 5790 comments.

The annotation process was performed independently and manually by four computer science experts, three working as instructors in the Department of Computer Science at a different university to the authors (Kwara State University, Nigeria); additionally, the first author of the present paper was involved in labelling. In creating the Gold standard MOOC Urgency Corpus, we took a similar approach as that used for creating the Stanford dataset as on (Agrawal and Paepcke)’s website (https://datastage.stanford.edu/StanfordMoocPosts/) and in their research (Agrawal et al. 2015). Specifically, a Likert scale from 1–7 was used to classify the urgency of the comments: a value of 1 indicates that no reason exists for the instructor to read the post, while a value of 7 indicates extreme urgency (as shown in Fig. 1); for more information, see Sect. 3.1.2.

First, the data were pre-processed to exclude all comments with unmeaningful labels, such as (`, 44, 0 and empty); this left a total of 5786 comments.

Then, we validated and evaluated the quality of the manually labelled comments, by using the weighted Krippendorff's α (Antoine et al. 2014). The resulting agreement between annotators was low (α = 0.33); on the other hand, the Stanford dataset suffered partially from similar issues; the agreement between the optimal coder combination for the Likert variable (1–7) varies considerably per domain (Education: 0.14; Humanities/Sciences: 0.52; Medicine: 0.63).

Therefore, we first converted the (1–7) scale into a simplified (1–3) scale, as per Fig. 2. This meant, e.g. mapping 1, 2, 3 as non-urgent together—as they all are non-actionable, into (1). When recalculating the agreement, it remained, however, low (α = 0.31).

Thus, to be able to use the data reliably, we decided to identify a dependable sub-set; this sub-set was selected by including only comments that have a level of agreement between annotators of > 75%; in other words, at least 3 annotators (out of 4) must have agreed on the comment’s label. Thus, we used a voting method, which is considered the most appropriate way to integrate different opinions about the same task (Troyano et al. 2004). In this case, only 4622 reliable comments could be included in the gold standard dataset (approximately 80% of the original data).

As we aimed to obtain as many potentially urgent comments as possible, we framed the problem as a binary classification problem, with outputs Urgent and Non-urgent, by converting and ranking the gold standard labels as:

Scale = 2 or 3 → Urgent.
Scale = 1 → Non-urgent.

Figure 3 depicts the final gold standard labels generated for this research. Please note that we erred on the side of caution in this final step by including neutral comments (urgency = 4) as urgent. This is because, for the Stanford data (Sect. 3.1.2), while some researchers supposed that urgency ≥ 4 represents Urgent comments (Almatrafi et al. 2018), others regard urgency > 4 as Urgent (Guo et al. 2019). As here we were only working with integer values for labels, we considered that a value of 4 and above signifies Urgent. This is also in line with our protocol on favouring recall and false positive (FP).

Therefore, we define urgent comments as the comments that need response from instructors. In general, the urgent comments can be about some specific problem encountered, or other latent causes, such as frustration, lack of knowledge, and change in circumstances.

Unsurprisingly, for our UNITE dataset, this division still resulted in a very high proportion of the comments being categorised as non-urgent (93%, i.e. 4,292 comments; with only 330 urgent comments 7%), showing a high degree of imbalance.

3.1.2 Stanford MOOC post dataset

The Stanford dataset (Agrawal et al. 2015) is a gold standard dataset available to academic researchers on request. It contains a large number of English learner forum posts (29,604 in total), commenting on 11 Stanford University MOOCs across three different domains (Humanities/Sciences, Medicine, and Education). Humanities/Sciences include six courses, Medicine four courses, and Education one course. The forum-post annotation was performed by three independent human coders: each post was manually labelled on six dimensions (confusion, sentiment, urgency, question, answer and opinion). Scores for confusion, sentiment and urgency were scored from 1 (low) to 7 (high). Meanwhile, the scores for the other items, question, answer and opinion, were classified using a binary scale (0 or 1). For more information, see (Agrawal and Paepcke) website (https://datastage.stanford.edu/StanfordMoocPosts/).

Similar to UNITE, for the Stanford dataset, the data were pre-processed by removing unmeaningful comments. This resulted in a total of 29,597 comments.

In the Stanford dataset, the urgency score (i.e. how urgent is it that an instructor reads the post) ranged from 1 = non-urgent to 7 = very urgent as shown in Fig. 1. However, in the current paper, we followed the classification detailed in Sect. 3.1.1: we framed the problem of detecting urgency as a binary classification task, by converting urgency into a binary value:

Scale > 4 → Urgent.
Remainder → Non-urgent.

We set our scale to > 4, because in the Stanford dataset, the label-calculating method does not produce an integer (1/1.5/2/2.5/3/3.5/4/4.5/5/5.5/6/6.5/7). This is further supported by our previous findings (Alrajhi et al. 2020), where we found a correlation between specific values (4 and 4.5) for the sentiment and confusion scales.

Ultimately, across the whole dataset, non-urgent cases represented 81% (23,991 comments) and urgent cases represented 19% (5606 comments—varying between 3.2 and 37.6% within their 11 courses) (with urgent posts having urgency > 4).

3.2 Experiments for imbalanced data

To achieve a comprehensive understanding of the best way to automatically identifying the urgency of comments on MOOCs, we use, as mentioned, two common supervised machine learning strategies (traditional shallow ML and deep learning—with BERT) to automatically classify the comments. Additionally, as urgency-detection is a typically imbalanced data problem; hence, any MOOC provider would need to take imbalance into account—we experiment with various techniques to deal with input data, as per Fig. 4.

First, we apply several training models to the original data on the gold standard UNITE corpus. Then, to improve performance, we design and develop three solutions to handle imbalanced data: (i) text augmentation; (ii) text augmentation + undersampling; and (iii) undersampling (see details in SubSect. 3.2.2). Text augmentation involves performing a range of approaches in different combinations, to augment the minority class in the training data. In undersampling, we randomly select instances from the majority class. Text augmentation + undersampling is a combinations of the two previous techniques. All the experiments were conducted using a stratified four fold cross-validation approach, to ensure representative results. The general architecture of the proposal classification model shown in Fig. 5 as we explained all the experiment in details in Sect. 3.2.

3.2.1 Classifiers

As said, we compare two major classification model types to classify the comments: (i) shallow machine learning (a basic model typically used by machine learning algorithms), and (ii) BERT (one of the most popular transformer models, as further explained).

3.2.1.1 Shallow machine learning

We apply several machine learning models (see Fig. 6) to the classification task, each with different fundamental mechanisms for feature engineering, to capture the most effective features. This includes count vector and term frequency inverse document frequency (TF-IDF) to find an adequate classifier to predict urgent comments. We extract different feature sets via four different classical methods: (i) count vector; (ii) TF-IDF vector (word level); (iii) TF-IDF vector (n-gram word level); and (iv) TF-IDF vector (n-gram character level). Then, we build different popular classifiers across these different sets of features (naive Bayes, logistic regression, support vector machine, random forest, and boosting model—extreme (to become as gradient boosting (XGBoost)), as displayed in Fig. 6.

We represent each comment with a specific vector; the count-vector counts the frequency of every given word in every comment. TF-IDF calculates the score of a numerical statistic to evaluate the extent of relatedness between a particular word and a specific comment in a collection of comments; it thus represents a measure of how important a word is in a collection of comments. Three different levels of TF-IDF were considered as tokens (word, n-gram word with range (2,3) and n-gram character with a range of (2,3)) with maximum features = 5000.

3.2.1.2 BERT

For deep learning, we employ the currently most popular and competitive approach in text classification tasks: BERT. Using BERT enabled us to avoid feature engineering, as well known for deep learning. We fine-tune a pre-trained ‘BERT-Base, Uncased- (L = 12, H = 768, A = 12, Total Parameters = 110M)’ version of the BERT classifier, which is the smaller model of the two available and was selected due to shorter training time, with one additional layer for the classification. For the BERT input, which is a sequence of tokens, we limit each comment to the final 128 tokens. This decision on the final tokens and size is based on various pre-experiment trials (final/first tokens; different sizes) that rendered this number (128 tokens) as the most suitable, encompassing most comments, with truncation only affecting 8% of UNITE, and 10% of the Stanford data. We use the Adam optimizer to tune BERT over 4 iterations.

3.2.2 Text balancing techniques

We developed several classifier models based on different techniques for manipulating the data. First, each of our models were run using the original gold standard corpus. Then, to tackle the imbalance problem, we independently applied the following approaches: (i) text augmentation; (ii) combined text augmentation then undersampling; and (iii) resampling using undersampling.

3.2.2.1 Original data usage (gold standard corpus)

As an initial experiment, we implemented all our models directly with original UNITE data. We split the dataset into four groups using stratified k-fold cross-validation, choosing a value of k = 4 (4 folds). We chose the k-fold cross-validation run approach because it allowed us to obtain results with less bias to specific data (Berrar 2019). We use stratification in the dataset: the selection of data led to an equal distribution of every class in every set. Thus, every fold contained the same percentage of samples from each class (see Fig. 7) as follows: training fold 3466 or 3467 samples (3219 as class 0, i.e. non-urgent, and 247 or 248 as class 1); testing fold 1156 or 1155 samples (1073 as 0 and 83 or 82 as 1) in each iteration see Table 1. Please note that we did not use the more frequently encountered ten fold validation, as, due to the very low number of urgent cases, this would have resulted in a too-low value per stratum for efficient stratification.

Table 1 Number of cases for every class in (training, test) sets in each iteration: original data

Solving the imbalanced data issue: automatic urgency detection for instructor assistance in MOOC discussion forums

Abstract

Similar content being viewed by others

Intervention Prediction in MOOCs Based on Learners’ Comments: A Temporal Multi-input Approach Using Deep Learning and Transformer Models

Exploring Bayesian Deep Learning for Urgent Instructor Intervention Need in MOOC Forums

Urgency Analysis of Learners’ Comments: An Automated Intervention Priority Model for MOOC

1 Introduction

1.1 Contributions

2 Literature review

2.1 Instructor intervention in MOOC forums

2.2 Text augmentation

2.3 Adaptive models in MOOCs

3 Methodology

3.1 Datasets

3.1.1 Building UNITE: a Futurelearn-based dataset

3.1.1.1 Creating the gold standard dataset (UNITE)

3.1.2 Stanford MOOC post dataset

3.2 Experiments for imbalanced data

3.2.1 Classifiers

3.2.1.1 Shallow machine learning

3.2.1.2 BERT

3.2.2 Text balancing techniques

3.2.2.1 Original data usage (gold standard corpus)

3.2.2.2 Text augmentation

3.2.2.3 Text augmentation + undersampling

3.2.2.4 Undersampling (random)

3.2.2.5 Future learn and Stanford datasets

3.3 Illustration of adaptive intervention models

3.3.1 Semi-automatic instructor intervention: basic scenario

3.3.2 Semi-adaptive instructor intervention: expanded scenario based on coarse granularity and expanded learner models

4 Results

4.1 Experiments for imbalanced data

4.1.1 Shallow machine learning on the UNITE dataset

4.1.2 BERT on the UNITE dataset

4.1.3 BERT on the Stanford dataset

4.2 Adaptive intervention models

4.2.1 Basic adaptation scenario

4.2.2 Expanded adaptation scenario

5 Error analysis

6 Discussion, limitations and future work

7 Conclusion

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendices

Appendix A

Appendix B

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation