Skip to main content

A framework to improve churn prediction performance in retail banking

Abstract

Managing customer retention is critical to a company’s profitability and firm value. However, predicting customer churn is challenging. The extant research on the topic mainly focuses on the type of model developed to predict churn, devoting little or no effort to data preparation methods. These methods directly impact the identification of patterns, increasing the model’s predictive performance. We addressed this problem by (1) employing feature engineering methods to generate a set of potential predictor features suitable for the banking industry and (2) preprocessing the majority and minority classes to improve the learning of the classification model pattern. The framework encompasses state-of-the-art data preprocessing methods: (1) feature engineering with recency, frequency, and monetary value concepts to address the imbalanced dataset issue, (2) oversampling using the adaptive synthetic sampling algorithm, and (3) undersampling using NEASMISS algorithm. After data preprocessing, we use XGBoost and elastic net methods for churn prediction. We validated the proposed framework with a dataset of more than 3 million customers and about 170 million transactions. The framework outperformed alternative methods reported in the literature in terms of precision-recall area under curve, accuracy, recall, and specificity. From a practical perspective, the framework provides managers with valuable information to predict customer churn and develop strategies for customer retention in the banking industry.

Introduction

The strategy of traditional banks has always been centered on their products and services, prioritizing internal practices and processes and considering customers as the commercial target (Lähteenmäki and Nätti 2013). However, the advancement of computing power and the reduction in its costs have led to a significant transformation in the industry by rapidly implementing new and highly competitive financial products and services (Feyen et al. 2021). Competitiveness has grown sharply as new financial technologies (e.g., FinTech) have risen (Murinde et al. 2022). Fintech is recognized as one of the most significant technological innovations in the financial sector and is characterized by rapid development (Kou et al. 2021a). Digital innovations coupled with abundant data have made it possible to devise new businesses and financial services, placing customers at the center of marketing decisions (Pousttchi and Dehnert 2018). In this scenario, managing customer retention is critical to avoid the defection of valuable customers (Mutanen et al. 2010). Reichheld and Sasser (1990) pointed out that reducing churn by 5% might increase a bank’s profits from those customers by 85%.

Churn prediction is key in churn management programs (Ascarza 2018). Thus, customer churn prevention is a strategic issue that allows companies to monitor customer engagement behavior and act when engagement decreases (Gordini and Veglio 2017). Additionally, as the costs of acquiring new customers are typically higher than retaining existing ones, developing reliable strategies for churn management is critical for the sustainability of companies (Lemmens and Gupta 2020). However, managing customer churn is a challenging task for marketing managers (Ascarza and Hardie 2013).

Therefore, predictive models tailored to identify potential churners are fundamental in supporting managerial decisions (Zhao et al. 2022). Several models have been proposed to predict churn (Benoit and Poel 2012; He et al. 2014; Gordini and Veglio 2017), estimate the churn probability of each customer, and assess the accuracy of these predictions in the holdout sample. However, most of these methods focus on the type and mathematical properties of the model used to predict churn. While this concern is legitimate and important, less effort is devoted to data preprocessing, which comprises a set of tasks performed before training the models to improve the predictive performance (Jassim and Abdulwahid 2021). Furthermore, one of the main challenges of customer churn prediction (CCP) is training binary classification models when the churn event is rare (Zhu et al. 2018), as is the case of the retail banking industry (see Mutanen et al. 2010 and Keramati et al. 2016). The low proportion of the rare event compared with the majority class (e.g., retained clients) can create pattern gaps, making it hard for predictive models to identify a customer who is likely to churn. In extreme cases, a classification model may classify all customers as retained, mispredicting churns (Sun et al. 2009; Megahed et al. 2021). For instance, suppose there is a company with 98% of retained customers and 2% of churned customers and we want to predict churners. In this case, if we directly apply a commonly used learning algorithm, such as random forest or logistic regression, the outcome will probably be a high accuracy for the majority class (recall measure) and a poor specificity measure for the rare class.

Therefore, proposing strategies to deal with the problem of highly imbalanced datasets is a relevant topic. According to Megahed et al. (2021), a challenging task is to increase the number of instances of the rare class by collecting additional samples from the real-world process, so rebalancing methods are important. These methods are part of the broad concept of data preprocessing, which is a set of strategies carried out on raw data before applying a learning algorithm. Recent studies aim to increase CCP performance by testing multiple alternative techniques along the predictive modeling process, including outliers treatment, missing values imputation (MVI), feature engineering (FE), imbalanced dataset treatment (IDT), feature selection (FS), hyperparameters optimization, and testing different classification models. Although such approaches address the challenges of modeling churn behavior, they do not deal with scenarios where the churn event is rare (Zhu et al. 2018). Therefore, data analysts should better understand the data preprocessing idea to ensure that data manipulation does not change the underlying data structure and favors the classification technique applied to such data (Megahed et al. 2021).

This study proposes a novel framework for CCP in the banking industry, where rare churn events are persistent (Gür Ali and Arıtürk 2014). Rarity tends to jeopardize the performance of traditional techniques aimed at binary classification. Our objectives are as follows:

  1. (1)

    Propose and validate a data preprocessing phase that combines different approaches for data preprocessing (FE focused on the retail banking context, IDT oversampling (IDT-over), and IDT undersampling (IDT-under));

  2. (2)

    Perform and validate a model training and testing phase with state-of-the-art classification techniques (XGBoost and elastic net) while using Bayesian approaches for hyperparameter optimization;

  3. (3)

    Evaluate and discuss the impact of different data preprocessing approaches to improve the predictive power of classification models.

Our study makes a valuable contribution to the research on decision support systems. It sheds light on the importance of conducting data preprocessing steps in scenarios where a substantial class imbalance is observed. To address this issue, rigorous tests were applied, confirming the practical value of the model. The deliverables include the following:

  1. (1)

    A sequence of steps applied to binary classification problems characterized by a highly imbalanced class;

  2. (2)

    A set of features created with superior capacity to predict churn in the banking industry, using recency, frequency, and monetary value concepts (RFM) in the FE stage;

  3. (3)

    A framework with high predictive performance to anticipate churn events;

  4. (4)

    An evaluation of feature importance to assist managers in designing more effective measures to prevent churn.

The rest of the paper is organized as follows. “Literature review” section reviews the literature; “Proposed framework for CCP” section describes our framework; “Data and empirical context” section presents the tests with the proposed framework; “Results and discussion” section discusses the results; and, finally, “Conclusions” section concludes.

Literature review

In this section, we present a literature review on the fundamentals of the techniques employed in the proposed framework.

CCP

Financial institutions have been challenged to develop new methods for mining and interpreting customers’ data to help them understand their needs (Broby 2021). Broby (2021) also reported that the advancement of technology has reduced information asymmetry between banks and their clients, boosting competition and making customer retention management increasingly important. Similarly, Livne et al. (2011) examined the link between customer relationships and financial performance at the firm level in the context of US and Canadian wireless (cellular) firms. They found a positive association between customer retention and future revenues. They also suggested that customer retention and the level of service usage play an important mediating role in the relationship between investments in customer acquisition and financial performance. These results indicate the importance of generating and retaining customer loyalty to drive financial performance.

Given the relevance of retaining customers, churn prevention practices have gained significance. The development of churn prediction models is an effective alternative to potentiate customer retention efforts. According to Broby (2022), statistical and computational models of machine learning such as CCP models are increasingly being integrated into decision support systems in the financial area. Thus, many studies have been conducted to determine effective classification models for churn prediction, especially when using data from financial sources (Lahmiri et al. 2020). However, modeling financial data is complicated due to the presence of multiple latent factors that can evolve and are usually correlated and even autocorrelated (Li et al. 2022). Table 1 presents some studies that focused on the banking industry to exemplify the methods commonly used to preprocess the dataset and the classification models adopted. In the table, we also compare our study with the other studies.

Table 1 Previous research assessing churn prediction models (specifications of our study are presented in the bottom line for comparison purposes)

The extant studies typically focused on comparing the performance of predictive models using different fundamentals and the managerial benefits of churn prediction on customer management. However, data preprocessing methods and approaches for hyperparameter optimization are scarce in the CCP literature. Usually, authors that described data preprocessing adopted traditional practices already established in previous studies, including random undersampling (RU) (Benoit and Poel 2012; He et al. 2014; Gordini and Veglio 2017), MVI (Lemmens and Croux 2006), and outlier detection and elimination (Zhao and Dang 2008; Keramati et al. 2016). Most studies, except that by Geiler et al. (2022), have not analyzed how predictive performance can benefit from employing more robust data preprocessing techniques, especially when a substantial imbalance between classes is verified (a common issue in CCP).

There are noticeable differences between previous studies and our propositions. To the best of our knowledge, our study is the first to propose a complete framework for churn prediction that encompasses a sequence of preprocessing steps prior to the classification task. Additionally, our framework relies on more sophisticated algorithms for IDT-over and IDT-under steps, enabling a performance gain due to the synergy between these steps. Our study also differs from that of Lemmens and Croux (2006), He et al. (2014), Farquad et al. (2014) and Geiler et al. (2022) in terms of the distribution of the features—both IDT-over and IDT-under did not modify the sample distribution significantly (as proved by a KS test) because the number of added artificial instances was controlled. This is desirable because it will not change the relationship between features. Another interesting difference is that our framework did not devote a step to FS as seen in the studies by Farquad et al. (2014) and Keramati et al. (2016) because the classification techniques employed to perform this task use an embedded approach. Finally, the FE step depends on managers’ ability to analyze the context to create meaningful predictive features as a weak aspect of the proposed framework.

Data preprocessing methods

For Sammut and Webb (2010), data preprocessing aims at transforming raw data into useful information. The predictive performance of a modeling framework increases by adopting data preprocessing stages before proceeding to the model training. García et al. (2014) characterized this phase as performing data selection and cleaning tasks. Additionally, for Pyle (1999), this phase comprises more than 80% of the modeling effort, encompassing activities beyond cleaning and selection. In our study, we adopted three preprocessing methods—FE, IDT-over, and IDT-under—which are described in more detail in the following sections.

FE

FE is performed to transform the raw data into a new format that favors modeling and contributes to performance gains (Khoh et al. 2023). It involves the initial discovery of features and their stepwise improvement based on domain knowledge to adjust the data representation (Kuhn and Johnson 2019). Mathematical operations, interactions between attributes, matrix decomposition, discretization, and binarization are typical FE operations performed regardless of the business context. However, based on content assumptions, we can create new features using several practices, e.g., business context and customer behavior (Zheng and Casari 2018). According to Ascarza et al. (2018), customer retention involves the sustained continuity of customer transactions by a company. In this sense, RFMs are constituted by transactional data and can enhance prediction performance by expressing customer behavior and enabling churn predictors (Fader et al. 2005). Similarly, Kou et al. (2021b) used transaction data to enhance bankruptcy prediction in the banking industry.

IDT-over

In binary classification modeling, it is common to have a rare class (Megahed et al. 2021). In extreme cases, a class rarity does not convey a sufficiently delimited decision boundary between two groups (Weiss 2004). Classification problems such as bank fraud detection, infrequent medical diagnoses, and oil spill recognition in satellite images are well-known examples of rare classes (Galar et al. 2012). The imbalance between categories can make predictive classification modeling unfeasible. IDT-over methods try to circumvent this by generating synthetic instances that reinforce the boundary between classes (Triguero et al. 2012).

Researchers have used IDT-over methods in two ways—to create a new representation of the data (dismissing the original data completely) and to oversample (creating artificial instances of the rare class) (Sun et al. 2009). One popular oversampling algorithm is the synthetic minority oversampling technique (SMOTE), which assembles synthetic examples of the infrequent category (Fernandez et al. 2018). With similar purposes, the ADASYN algorithm searches for cases that are more difficult to discriminate between classes and generates synthetic instances to reinforce them (He et al. 2008). The algorithm uses a k-nearest neighbor (KNN) algorithm to add new artificial examples related to the minority class, making it more distinct from the majority class. The objective of ADASYN is to identify a weighted distribution of the rare class, considering its learning difficulty (He et al. 2008). It uses rare class instances very close to the majority class to generate other cases, reinforcing points in the boundary between the two groups. Recent studies have demonstrated that ADASYN outperforms SMOTE (Dey and Pratap 2023). These findings highlight ADASYN as a robust oversampling technique to handle imbalanced data scenarios.

IDT-under

IDT-under methods remove instances from the majority class to balance the groups (Fernandez et al. 2018). The techniques range from simple RU to more refined algorithms. However, procedures such as RU do not guarantee the retainment of instances that maximize the identification of patterns, failing to select significant examples (He and Ma 2013). Therefore, Lin et al. (2017) proposed the CLUS algorithm to maximize the heterogeneity of the majority class instances by reducing redundancies. Similarly, Zhang and Mani (2003) proposed the NEARMISS algorithm.

The NEARMISS algorithm uses a KNN-based method to identify and remove instances with noisy patterns or those already represented in the data (redundant) (Zhang and Mani 2003). The procedure selects majority class instances with large average distances to the KNNs, keeping the majority class instances at the decision boundary (Bafna et al. 2023).

Classification algorithms

Researchers have developed several classification algorithms to predict a churn event. As either a customer will continue to be a customer or will churn, binary classification models are a promising alternative to address this problem. We now present the fundamentals of the two state-of-the-art classification techniques we test in our framework—XGBoost and elastic net.

XGBoost

The XGBoost is a tree-based ensemble method under the gradient-boosting decision tree framework developed and implemented in the R programming language in the package “xgboost” (Chen et al. 2022). This method uses an ensemble of classification and regression trees (CARTs) to fit the training data samples, utilizing residuals to calibrate a previous model at each iteration toward optimizing the loss function with a performance metric [e.g., precision-recall area under curve (PR-AUC)] (Chen and Guestrin 2016). The XGBoost adds one CART when it identifies a subset of hard-to-classify instances. To avoid overfitting in the calibration process, XGBoost adds a regularization term into the objective function, controlling the complexity of the model. It also combines first- and second-order gradient statistics to approximate the loss function before optimization (Chen and Guestrin 2016). Equation 1 presents the objective process of the XGBoost model at each iteration.

$$J\left({f}_{t}\right)\cong {\sum }_{i=1}^{n}\left[L\left({y}_{i}, {\widehat{{y}_{i}}}^{t-1}\right)+{g}_{i}{f}_{t}\left({\overrightarrow{x}}_{i}\right)+\frac{1}{2}{h}_{i}{{f}_{t}}^{2}\left({\overrightarrow{x}}_{i}\right)\right]+ \Omega \left({f}_{t}\right)$$
(1)

where \({\overrightarrow{x}}_{i}\) represents the ith instance in the training set (\({\overrightarrow{x}}_{i}\in {R}^{m}\), where \(m\) is the number of features); \({y}_{i}\) denotes the ith observed instance in the data set (\({y}_{i}\in R\)); \({\widehat{y}}_{i}^{t}\) symbolizes the prediction of the ith instance at the tth iteration; \({f}_{t}\) is the CART added in the tth iteration; \({g}_{i}\) and \({h}_{i}\) are the first and second derivatives of the loss function \(L\), respectively; and \(\Omega\) characterizes the regularization term described in Eq. 2 (Chen and Guestrin 2016):

$$\Omega \left({f}_{t}\right)=\gamma {T}_{t}+\frac{\lambda }{2}{\sum }_{j=1}^{T}{{\omega }_{j}}^{2}$$
(2)

here \(\gamma\) and \(\lambda\) are constants controlling the regularization process; \(T\) denotes the number of leaves in the tree; and \({\omega }_{j}\) is the score of each leaf. Representing the structure of a CART \(f\) as a function \(q:R^{m} \to \left\{ {1,2,3, \ldots ,T} \right\}\) mapping an observed data instance to the corresponding leaf index, we derive \(f\left({\overrightarrow{x}}_{i}\right)={\omega }_{q\left({\overrightarrow{x}}_{i}\right)}\), with \({\omega }_{q\left({\overrightarrow{x}}_{i}\right)}\in {R}^{T}\). Next, plugging Eq. 2 into Eq. 1, removing the constant terms \(L({y}_{i}, {\widehat{{y}_{i}}}^{t-1})\) that do not impact the optimization process, and defining \({I}_{j}=\left\{q\left({\overrightarrow{x}}_{i}\right)=j\right\}\) as the instance set of leaf \(j\), we can rewrite the objective function at each iteration as Eq. 3 (Chen and Guestrin 2016).

$$J\left({f}_{t}\right)\cong {\sum }_{j=1}^{T}{\omega }_{j}\left[{\sum }_{i\in {I}_{j}}{g}_{i}+\frac{{\omega }_{j}}{2}\left(\lambda +{\sum }_{i\in {I}_{j}}{h}_{i}\right)\right]+ \gamma T$$
(3)

For a fixed tree structure \(q\), the optimal weight of leaf \(j\) (\({{\omega }_{j}}^{*}\)) is obtained simply by deriving Eq. 3 with respect to \({\omega }_{j}\) and equating it to zero, leading to Eq. 4 (Chen and Guestrin 2016).

$${{\omega }_{j}}^{*}=-\frac{{\sum }_{i\in {I}_{j}}{g}_{i}}{\lambda +{\sum }_{i\in {I}_{j}}{h}_{i}}$$
(4)

The optimal value of the objective function is Eq. 5 (Chen and Guestrin 2016).

$${J}^{*}\left({q}_{t}\right)=-\frac{1}{2}{\sum }_{j=1}^{T}\frac{{({\sum }_{i \in {I}_{j}}{g}_{i})}^{2}}{\lambda +{\sum }_{i \in {I}_{j}}{h}_{i}}+ \gamma T$$
(5)

Equation 6 is used as a scoring function to gage the tree structure quality. High gain scores denote tree structures that better discriminate observations. As it is impossible to enumerate all possible tree structures, a greedy algorithm starting from a single leaf and iteratively adding branches to the tree is used. Letting \({I}_{L}\) and \({I}_{R}\) represent the instance sets of left and right nodes, respectively, after a split (\(\mathrm{i}.\mathrm{e}., I={I}_{L}\cup {I}_{R}\) is the instance set of the split node before splitting it), the new tree has \(T+1\) nodes with the same structure, except around the split node.

$$Gain = \frac{1}{2} \left[ \frac{{\left({\sum }_{i\in {I}_{L}}{g}_{i}\right)}^{2}}{\lambda +{\sum }_{i\in {I}_{L}}{h}_{i}}+\frac{{\left({\sum }_{i\in {I}_{R}}{g}_{i}\right)}^{2}}{\lambda +{\sum }_{i\in {I}_{R}}{h}_{i}}-\frac{\lambda +{\sum }_{i\in I}{h}_{i}}{\lambda +{\sum }_{i\in I}{h}_{i}}\right]- \gamma$$
(6)

This process quantifies how a given node split compares to the previous one. Once the \(Gain\) is negative, the algorithm will stop the cotyledon depth growth. The XGBoost algorithm prunes the features that do not contribute to the prediction (embedded FS) and generates feature importance ranking (\(Gain\)) based on the relative contribution of each attribute to the model, as depicted in Eq. 6.

Elastic net

Friedman et al. (2010) developed and implemented fast algorithms for fitting generalized linear models with elastic net penalties in the R programming language package “glmnet.” We adopted the two-class logistic regression. It considers a response variable \(Y\in \left\{-1,+1\right\}\) and a vector of predictors \(\overrightarrow{x}\in {R}^{m}\) (\(m\) is the number of features), representing class-conditional probabilities through a linear function of the predictors (see Eq. 7):

$$P\left(Y=-1|\overrightarrow{x}\right)=\frac{1}{1+{e}^{-\left({\beta }_{0}+{\overrightarrow{x}}{\prime}\beta \right)}}=1-P\left(Y=-1|\overrightarrow{x}\right),$$
(7)

which is equivalent to stating that

$$log \frac{P\left(Y=-1|\overrightarrow{x}\right)}{P\left(Y=+1|\overrightarrow{x}\right)} ={\beta }_{0}+{\overrightarrow{x}}{\prime}\beta$$
(8)
$$P\left(Y=-1|\overrightarrow{x}\right)+P\left(Y=+1|\overrightarrow{x}\right)=1$$
(9)

The model fit occurs by regularized maximum (binomial) likelihood. Let \(P\left({\overrightarrow{x}}_{i}\right)\) be the probability for the ith instance of the training set at a particular value of parameters \({\beta }_{0}\) and \(\beta\). \({\overrightarrow{x}}_{i}\in {R}^{m}\) represents the ith instance sample of the training set, and \({y}_{i}\) is the ith observed example in the data set (\({y}_{i}\in \left\{1,+1\right\}\)). We maximize the penalty log-likelihood function as follows:

$$\frac{1}{n}{\sum }_{i=1}^{n}\left\{I\left({y}_{i}=-1\right)loglog P\left({\overrightarrow{x}}_{i}\right) +I\left({y}_{i}=+1\right)loglog \left[1-P\left({\overrightarrow{x}}_{i}\right)\right] \right\}-\lambda {PF}_{\alpha }\left(\beta \right)$$
(10)

where \(I\left(\cdot \right)\) is the indicator function (note that \(I\left({y}_{i}=-1\right)+I\left({y}_{i}=+1\right)=1\)), and \({PF}_{\alpha }\left(\beta \right)={\sum }_{k=1}^{m}\left[\frac{1}{2}\left(1-\alpha \right){{\beta }_{k}}^{2}+\alpha \left|{\beta }_{k}\right|\right]\) is the elastic net penalty factor (Zou and Hastie 2005), representing a compromise between the ridge regression penalty (\(\alpha =0\)) and the lasso penalty (\(\alpha =1\)). We can rewrite Eqs. 8 and 9 more explicitly as follows:

$$\frac{1}{n}{\sum }_{i=1}^{n}\left\{I\left({y}_{i}=-1\right)\left({\beta }_{0}+{{\overrightarrow{x}}_{i}}{\prime}\beta \right)-loglog \left[1+{e}^{{\beta }_{0}+{{\overrightarrow{x}}_{i}}{\prime}\beta }\right] \right\}-\lambda {\sum }_{k=1}^{m}\left[\frac{1}{2}\left(1-\alpha \right){{\beta }_{k}}^{2}+\alpha \left|{\beta }_{k}\right|\right]$$
(11)

Bayesian hyperparameters optimization

In both algorithms, hyperparameter setup configures a critical stage to avoid overfitted models. The process may rely on optimization routines, such as the Bayesian hyperparameter optimization (Victoria and Maragatham 2021). This method uses a probabilistic model to find hyperparameter values that maximize the adopted performance metric (e.g., PR-AUC). The process is iterative, meaning the algorithm learns to reach optimal or suboptimal hyperparameter values at each interaction (Snoek et al. 2012; Kuhn 2022).

Proposed framework for CCP

The proposed framework for CCP comprises two phases—data preprocessing and model training and testing (Fig. 1). We explain each one in the following subsections.

Fig. 1
figure 1

Framework for CCP

Phase 1: data preprocessing

In this phase, we apply the three data preprocessing techniques (see “Data preprocessing methods” section) to prepare the dataset for model training. Each preprocessing procedure corresponds to an operational step, as detailed below.

Step 1: FE preprocessing

Most churn models use features that are not readily available in the dataset. Moreover, the original feature set may not be sufficiently informative to predict customer churn. Therefore, creating new features through FE is critical to expanding the number of potential predictor candidates.

In the first step of the proposed framework, we create new features based on RFM, which summarize customers’ purchasing behavior (Fader et al. 2005; Zhang et al. 2015) and are strongly related to churn behavior in the banking industry. Recency is the time of the most recent purchase; frequency corresponds to the number of prior purchases; and monetary value depicts the average purchase amount per transaction (Fader et al. 2005; Zhang et al. 2015; Heldt et al. 2021).

In our propositions, we considered the concept of the traditional RFM approach (Fader et al. 2005) and disaggregated per product category (RFM/P) (Heldt et al. 2021). We used credit cards, overall credits, and investments as categories. We proposed new features based on the original data and aligned them with the recency concept, which comprehends the overall recency, recency per product category, and recency per channel. We also proposed new features derived from frequency, overall, or per product category. We registered the number of periods with at least one transaction. In each period, we recorded the total number of transactions, the overall binary indicator of purchase incidence, the binary indicator of purchase incidence, and the binary indicator of using a specific channel. Finally, consistent with the monetary value concept, we proposed new features, such as the overall contribution margin, overall revenue, and overall value transacted per product category.

Step 2: IDT-over preprocessing

Aimed at data balancing, in the second step of the framework, we generated synthetic churned customers that act on the gaps in the decision boundary between the classes using the ADASYN algorithm. Such artificial churned customers must not change the data distribution, so we controlled significant differences between the original rare class distribution and the new data distribution using the Kolmogorov–Smirnov test (KS test). The KS test is a well-known nonparametric test that compares the distance between two cumulative distribution functions. Its null hypothesis is that both samples come from the same distribution.

Step 3: IDT-under preprocessing

After adding artificial churned customers, the third step removes noisy and redundant retained customers by applying the NEARMISS algorithm. We did it to equalize the remaining number of maintained and churned customers. Once again, we used the KS test to ensure that data removal did not modify data distribution.

The ADASYN algorithm inserts artificial instances of the rare class in regions where examples of this class are closer to the majority class, improving the distinction between classes. However, the NEARMISS algorithm removes instances from the majority class closer to the rare class (e.g., specimens hard to classify) to reduce noise and redundancies. The proposed framework first employs the ADASYN algorithm to reinforce the frontier between classes. After that, the NEARMISS algorithm provides more precise information on the best candidate instances of the majority class to remove in the subsequent step. We used the package “themis” in the R programming language (Hvitfeldt 2022) to implement NEARMISS and ADASYN.

Figure 2 illustrates the IDT preprocessing stages through a hypothetical example, with the original churned customers representing 20% of the data and the retained customers representing 80%. ADASYN added 5% of artificial churned customers, and NEARMISS removed 65% of the retained customers. The result is a balanced dataset comprising 50% churn and 50% nonchurn customers.

Fig. 2
figure 2

Illustration of the IDT stages with a hypothetical example

Phase 2: model training and testing

In the final preprocessing phase, we carry out Step 4 using the R package “tune” for Bayesian hyperparameter optimization (Kuhn 2022) of the classification techniques. Using XGBoost, we optimized (1) the number of trees in the ensemble, (2) the minimum number of data points in a node required for further splitting, (3) the maximum depth of each tree, (4) the rate at which the boosting algorithm adapts from iteration to iteration, (5) the reduction in the loss function required for further splitting, and (6) the amount of data exposed to the fitting routine. Then, using the elastic net, we optimized alpha (mixture parameter between a pure ridge model and a pure lasso model) and lambda (regularization penalty). The optimization algorithm maximized the average PR-AUC on a tenfold cross-validation procedure.

In the fifth step, we estimated the model parameters for XGBoost and elastic net classification algorithms using the hyperparameters gaged by the Bayesian optimization in Step 4. We conducted the procedure on samples in the training set.

In the sixth step, we assessed the model’s predictive performance using the following metrics: (1) accuracy, which is the proportion of churned and retained customers correctly classified; (2) specificity, which depicts the proportion of correctly classified retained customers against the total retained customers; (3) PR-AUC, which represents the area under the curve of a plot with recall and precision in the axes; and (4) recall, which denotes the model’s capacity to correctly classify retained customers (Sofaer et al. 2019).

Data and empirical context

We used data from a major bank operating in Rio Grande do Sul State, Brazil. We selected 36 months of customer transaction history with the bank (December 2018 to November 2021), retrieving about 170 million transactions and more than 3 million customers. The class imbalance is high—2.25% of the customers belong to the minority class (churned customers), and the remaining customers belong to the majority class (retained customers). For more details, an exploratory data analysis is provided in “Appendix A”.

In this study, we restricted the scope of analysis to business-to-consumer. The bank offers several financial services, including credit, investment, insurance, and private pension services. It operates under a contractual context, meaning that it experiences customer churn if a customer closes all accounts with the bank.

The relational database relies on 23 tables with customer transaction information, including sociodemographic features, full transaction logs, product categorization, and online channel usage. We present the 32 original features in Table 2.

Table 2 Original features

Results and discussion

In this section, we present the framework validation considering different configurations of preprocessing stages and classification techniques.

Framework testing

Combining all three methods sampled for the preprocessing phase (see “Phase 1: data preprocessing” section), we derive eight combinations: FE (yes or no) × IDT-over (ADASYN or no) x IDT-under (NEARMISS or RU). We compare all the combinations against a ninth alternative taken as a baseline of not doing any preprocessing phase (no FE, no IDT-over, and no IDT-under) using only the 32 original features. Moreover, we combined all nine preprocessing choices with the two classification techniques (XGBoost or elastic net) examined, deriving 18 configurations for testing. They are presented in the first five columns of Table 5 as a framework.

ADASYN algorithm does not determine a priori the optimal number of churned customers to synthesize. We want to add a certain number of churned customers that reinforce the boundary between classes without compromising the sample’s representativeness. We performed experiments by adding as many artificial churned customers as possible without identifying any significant difference between the empirical cumulative densities of the distribution of rare class—original versus ADASYN.

We also performed experiments to define the KNN parameter, as both ADASYN and NEARMISS rely on it. Numerical experiments suggest that \(k=5\).

The implementations used the R programming language version 4.2.1 (R Core Team 2022) package “themis” (Hvitfeldt 2022) for NEARMISS and ADASYN implementations, package “tune” for Bayesian hyperparameter optimization (Kuhn 2022), package “xgboost” (Chen et al. 2022) for XGBoost, and package “glmnet” (Friedman et al. 2010) for elastic net. Next, we describe the process of creating new features in the FE stage.

New predictor features through FE

We use aggregated and disaggregated RFM per product category to create new predictor features. Table 3 presents the set of 28 features produced through the FE stage. The code for the new features starts at 33 and is in sequence with the original ones. We find that eight of these new features follow the traditional approach of RFM (Fader et al. 2005), aggregating variables related to RFM for each customer. The additional 20 created features follow the RFM per product approach (Heldt et al. 2021), disaggregating variables related to RFM per product category for each customer.

Table 3 Created features

Purchase frequency counts the number of consecutive months in which a customer purchased any product (this counting process restarts after any month with no purchase). Purchase recency sums the number of months since the last purchase. We split the remaining features into two alternatives based on the feature type. We take the most recent value in the training time frame for traits at nominal and binary levels, converting it to dummy features using the one-hot encoding procedure. For the other continuous features, we take the median of the last six periods.

In the following subsection, we analyze the impact of preprocessing stages and classification models on churn prediction performance. We also examine the influence of the IDT-over and IDT-under algorithms on the majority and rare class distributions.

Impact of IDT-over and IDT-under techniques on data distributions

Table 4 presents the amount and proportions of customers in the IDT preprocessing stages. The final ratio between classes is 50% each.

Table 4 The proportion between classes in the IDT preprocessing stages

After applying ADASYN and NEARMISS algorithms, we checked whether the resulting distributions of all features for both majority and rare classes are like their original distributions using the KS test. We found no statistically significant difference between all pairs of variables (before and after IDT-over and IDT-under), with p values ranging from 0.9 to 1. Figure 3 depicts the empirical cumulative distributions (original rare class × after ADASYN and original majority class × after NEARMISS) of two features—credit purchase recency and investment purchase recency. These features are among the most important for predicting churn behavior according to the importance features index (Gain) from XGBoost (discussed at the end of this section).

Fig. 3
figure 3

Comparison among cumulative distributions from the rare and majority classes after applying IDT preprocessing stages

Figure 3a, c depict that the artificially added churned customers did not modify the original distribution (p value 1.0), although ADASYN does not use information from the distribution. Figure 3b, d depict the same result for the NEARMISS algorithm—the distributions of retained clients and after-NEARMISS maintained customers are not significantly different (p value 0.9), although NEARMISS does not use information from the distribution.

Performance results

The predictive performance of the proposed framework (configurations C1 and C2 in Table 5) was tested against 16 alternative configurations by (1) using or not using FE, (2) using ADASYN for IDT-over or not using ADASYN, and (3) using NEARMISS or RU for IDT-under. We alternate the classification techniques (e.g., XGBoost and elastic net) in such configurations to assess their performance on churn prediction. We trained and tested each of these 18 configurations using 100 folds cross-validation. We used configurations C17 and C18 as references, not undergoing preprocessing stages. Table 5 presents the average PR-AUC, accuracy, recall, and specificity sorted by decreasing PR-AUC. “Appendix B” depicts a confusion matrix of the average of the 100-fold cross-validation procedure for the recommended configurations C1 and C2.

Table 5 Average of the 100-fold cross validation predictive performance for the 18 assessed configurations

Table 5 reveals that, on average, configurations C1 and C2 (e.g., FE, ADASYN, and NEARMISS coupled with XGBoost and elastic net) yielded the highest predictive performances across all performance metrics. It highlights that the recommended ADASYN and NEARMISS strategies outperformed all the benchmark models tested, including configurations relying on RU—an IDT-under algorithm usually reported by the literature.

These results corroborate the hypothesis that inserting new features based on a solid conceptual background, such as RFM and RFM/P, increases the performance of churn predictions. The ADASYN that precisely inserted churned customers in the minority class reinforced the decision frontier between the majority and minority classes. Similarly, the NEARMISS applied to the majority class selected the less contributive customers to discard (instead of randomly choosing such customers), reinforcing the decision frontier and improving the classification performance. We find a superior performance of the XGBoost classification technique, which significantly outperformed elastic net. Finally, configurations C17 and C18 reveal that a strongly imbalanced dataset with no preprocessing drops the classifier performance for both XGBoost and elastic net. Both classifiers did not detect any pattern of churned customers (zero recall) and classified all customers as retained, which accounted for a perfect specificity. Although the accuracies of C17 and C18 are the highest among all configurations, they are misleading as they did not classify a single churned customer correctly. Figure 4 depicts the boxplots for all 16 configurations and four performance metrics (we omit configurations C17 and C18 due to the effects of data unbalance on assessed performance metrics). Such boxplots depict a similar variability of all performance metrics in the 100 replicates, suggesting high stability of the assessed metrics even when the performance levels are low.

Fig. 4
figure 4

Boxplot for 16 configurations and metrics

Next, we statistically assessed how the different preprocessing steps and classification techniques impacted the performance metrics. We first employed the multivariate analysis of variance (MANOVA) to test whether the three stages (FE, data balance, and classification model) are significant in the four predictive performance metrics. To apply the MANOVA test, we initially tested for the normality of the performance metrics using the Shapiro–Wilk test. Only three out of the 64 performance averages did not pass the test (α < 0.05). Next, we carried out the MANOVA test. Table 6 presents the results.

Table 6 MANOVA test

The results indicate that all groups within the stages are different regarding the performance metrics. The preprocessing steps are informative for the predictive performance across all the metrics tested. The Pillai results for each stage indicate that FE had the highest impact on predictive performance (Pillai = 0.8115), followed by the classification model (Pillai = 0.5646), IDT-over (Pillai = 0.2739), and IDT-under method (Pillai = 0.2255).

Based on the MANOVA results, we conducted individual univariate ANOVA tests for each performance metric, as presented in Table 7. This was conducted to check the significance of each factor in each performance metric.

Table 7 Univariate ANOVAs

The univariate ANOVAs confirmed the significant differences when considering each performance metric separately. They also supported the higher impact on all performance metrics obtained from using FE relative to the other stages studied. Additionally, we assessed the statistical differences of the best-ranked configurations using C1 configuration (e.g., FE, ADASYN, NEARMISS, and XGBoost) as a reference. A post hoc analysis using pairwise t-tests revealed that the pairs C1–C2, C1–C3, C1–C4, and C1–C5 are all significantly different (α < 0.05) for all performance metrics.

Given the impact of the FE stage in the framework proposed, we extracted the feature importance (Gain) using Eq. 1 for C1 configuration to compare the contribution of the most relevant original and created features. Table 8 presents the 10+ features, comprising 80%Footnote 1 of the accumulated feature importance.

Table 8 10 most important features

Credit purchase recency is the topmost important feature, accounting for 46.23% of feature importance when predicting customer churn. This finding reinforces the conceptual foundations of customer behavior in the banking industry and the method used to support the proposition of this new feature. When acquiring credit products, clients sign contracts that usually last several months, so it is not surprising that a customer remains active for a while. Once the credit contract has ceased and recency starts to increase, the likelihood of defection also increases significantly, most likely because of low switching costs.

According to the traditional RFM concept, recency is more influential than frequency in defining the likelihood of a customer being active (Fader et al. 2005). Regarding the derived RFM/P concept, the same rationale is valid, with the only difference being the disaggregation per product category. In the present study, the relevance of credit purchase recency for customer retention in the banking industry confirms the findings of the existing literature.

Besides credit purchase recency, usage of mobile channels and investment purchase recency are also relevant for churn prediction using the recommended C1 configuration. It confirms the importance of features based on the RFM/P concept, which is consistent with the rationale that recency is critical to defining whether a customer is likely to remain active or not in the future.

Apart from the credit-related features, deposit and investment-related features were also influential for churn prediction. Customers’ withdrawals from checking and investment accounts indicate possible churn events in the future. As recency measures the time between withdrawals, smaller recency suggests that a customer is likely to churn. The recency of customer interactions through mobile channels also contributes to that sense, helping to identify customers that might be making transactions with a competitor.

Overall customer income is also deemed one of the most relevant features by the proposed framework as it contributes to increasing the retention rate. Low-income customers are less likely to purchase investment products and more likely to default. Additionally, due to their low credit limits, such customers often acquire credit from different financial institutions.

The number of distinct products purchased was also relevant to predicting customer churn. This feature correlates with the concept of purchase frequency as it indicates that a customer has purchased at least one financial service in a period. However, it also encompasses information on how many different financial services a customer has purchased. Therefore, it reflects the success of cross-selling efforts made by sales teams. The relevancy of this feature to predict churn indicates that the higher the number of financial services a customer purchases, the less likely she is to churn.

We analyzed the impact of customer churn on a firm’s financial performance in terms of the profit loss that would be avoided by employing the proposed framework. We observed that 97.25% of profitability could be maintained if successful action was taken to reverse the potential churn. When no action is taken, a firm would incur costs to acquire new customers in order to recover the loss in profitability.

Furthermore, customers that are most likely to defect are those currently interacting with other banks, with a median of 62% of businesses in other financial institutions (“Appendix D”, Fig. 5). Based on the Mann–Whitney test, a significant difference (p value 0.00) was found in the share of wallets between the churn group and the group of retained customers. The median share of wallet for the churn group (\({\widehat{\mu }}_{median}\) = 0.38) was substantially lower than the nonchurn group (\({\widehat{\mu }}_{median}\) = 0.78). This suggests that the churn group has business with other financial institutions, and their defection represents a missed opportunity to exploit their business opportunities. By retaining customers, a financial institution can capitalize on its business potential and generate long-term revenue streams.

Regarding a bank’s products, the profitability loss appears to come mainly from credit. For example, rural credit was the most affected product, accounting for 35% of the total lost margin and losing 8% of its margin during the sample period. This information suggests the need for adjustments in offering that product.

Managerial implications

To deal with customer churn, marketing managers try to identify in advance the decrease in customer engagement with their company and plan marketing efforts to prevent customers from defecting (Benoit and Poel 2012; Gordini and Veglio 2017). Regarding the retail banking sector, adequate customer retention management has become increasingly relevant due to the growing competition. The entry of new players relying on digital and scalable innovative solutions tends to replace traditional services offered by incumbents and reduce customers’ switching costs (Lazari and Machado 2021).

Managers benefit from the framework proposed in this study in many ways. They can anticipate potentially churning customers three months before, even with highly imbalanced data. The framework not only allows managers to know the most likely churning customers but also gives them a reasonable amount of time to plan marketing efforts proactively to prevent such churn from happening.

In addition, based on the FE step of the framework (Step 1) and using the concept of RFM, we sought to generate features related to churn behavior in the retail banking industry (see Tables 2, 3). This provides a tailored list of adequate predictors that managers can use to harness their existing databases, allowing classification algorithms to predict customer churn more precisely and thus increase the effectiveness of the customer management team to monitor and manage potential churners.

Finally, based on these proposed features and after estimating the classification algorithms, we also identify features with higher gain (Table 8). Analyzing such metrics contributes to understanding features that are the most relevant churn predictors in the retail banking context. For instance, credit purchase recency is the most pertinent feature. Although customers might transfer a credit contract to another banking institution anytime they want, this finding suggests that the existence of credit contracts significantly increases the likelihood of retaining a customer. Another key feature is the use of mobile channel recency. As the competition is becoming increasingly dependent on building digital relationships with customers, engaging them through the frequent use of mobile channels is critical for their loyalty to a company. In summary, knowing features that impact churn behavior is relevant for managers to drive well-designed marketing efforts and avoid customer defection. Thus, adopting the proposed framework has important managerial implications, providing managers with resources to enhance customer retention management and thrive in a digital market environment with growing competition and lower switching costs.

Conclusions

Our study makes a valuable contribution to the research on decision support systems by proposing a framework to model customer churn behavior in the context of highly imbalanced classes. We emphasize the importance of conducting data preprocessing strategies (i.e., FE, IDT-over, and IDT-under processes) to improve model performance. The FE process is fundamental for building new and informative features to better characterize customers’ behavior, directly impacting model effectiveness. Regarding the IDT-over and IDT-under strategies, they handle the class imbalance issue through the ADASYN and NEARMISS algorithms, respectively, providing the machine learning technique with a suitable number of instances at the modeling stage. In the IDT-over stage, the ADASYN algorithm reinforced the decision boundary toward the minority class (churned customers). In the IDT-under step, the NEARMISS provided an efficient way to undersampling the retained customers of the majority class such that the maintained customers have reduced noise and redundancies, making it easier to identify patterns with the XGBoost and elastic net models.

Compared with alternative configurations, we demonstrated that the algorithms used in our proposed framework perform well in PR-AUC, accuracy, sensibility, and specificity. Additionally, as we tested two classification models (XGBoost and elastic net), we demonstrated that adequate data preprocessing procedures improve the predictive power in classification models relying on different mathematical fundamentals.

The proposed framework also provided a methodological contribution to the literature on churn prediction by adequately dealing with highly imbalanced datasets. The thorough work on data preprocessing in the FE step and rebalancing classes, reinforcing the decision boundary between them, and reducing data redundancy and noise, combined with state-of-the-art classification techniques, led to a higher predictive performance of customer churn. Thus, our study substantially enhances customer retention practices, fostering the adoption of more effective marketing efforts to prevent churn. Preventing churn increases customer portfolio profitability.

These results confirm the effectiveness of conducting a detailed data preprocessing before proceeding to the model training. Given the lack of extant research on this topic, we encourage additional studies to investigate different data preprocessing methods. They should not only test diverse classification models to achieve gains in predictive performance but also test different combinations of data preprocessing techniques because it has a significant potential to provide even more accurate predictions.

Finally, we highlight some limitations in the proposed study that can become the subject of future research. One restriction is about using a single dataset, mainly due to the challenging task of obtaining real datasets from the retail banking industry. We encourage future studies to extend the proposed framework to other cases in this industry or even apply it to address binary classification problems other than churn prediction. Another limitation relates to the classification techniques tested in our experiments (XGBoost and elastic net); different algorithms, such as artificial neural networks and support vector machines, can also be tested. The decision to use only two types of algorithms was due to the focus of our study in the data preprocessing phase. Another limitation of the study is the absence of features regarding customer transactions with other companies. Data indicating customers’ share-of-wallet change over time would probably increase predictive performance. For instance, a decrease in the number of transactions of a given customer with a focal company may increase the number of transactions of this customer with competitors. The only available information regarding such behavior was the binary feature on whether a customer took credit with other companies in the banking system. However, we found only a weak association of this binary feature with churn. Therefore, it does not appear as a good churn predictor. Given the lack of such features in our dataset, we had to apply the framework utilizing data mainly related to the strict relationship between the focal company and a customer. Future studies can benefit from using features regarding customer transactions with other companies.

Availability of data and materials

The datasets generated and/or analyzed during the current study are not publicly available due to a data privacy agreement signed with the financial institution but are available from the corresponding author on reasonable request.

Notes

  1. Using the classic Pareto rule, Appendix C exhibits the complete list of remaining features resulting from XGBoost.

Abbreviations

C1:

Configuration 1

C2:

Configuration 2

C3:

Configuration 3

C4:

Configuration 4

C5:

Configuration 5

C6:

Configuration 6

C7:

Configuration 7

C8:

Configuration 8

C9:

Configuration 9

C10:

Configuration 10

C11:

Configuration 11

C12:

Configuration 12

C13:

Configuration 13

C14:

Configuration 14

C15:

Configuration 15

C16:

Configuration 16

C17:

Configuration 17

C18:

Configuration 18

CCP:

Customer churn prediction

FE:

Feature engineering

FS:

Feature selection

IDT:

Imbalanced dataset treatment

IDT-over:

Imbalanced dataset treatment oversampling

IDT-under:

Imbalanced dataset treatment undersampling

KNN:

K-Nearest neighbor

KS-test:

Kolmogorov–Smirnov test

MVI:

Missing values imputation

PR-AUC:

Precision-recall area under curve

OT:

Outliers treatment

RFM:

Recency, frequency, and monetary value.

RFM/P:

Recency, frequency, and monetary value/product

RU:

Random undersampling

SMOTE:

Synthetic minority oversampling technique

References

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

JBGB designed the model, and the computational framework analyzed the data and carried out the implementation. GBB, RH, JLB, and MJA verified the analytical methods and contributed to writing the manuscript with input from all authors. CSS, FBL, and MJA supervised the findings of this work and contributed to the final version of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to João B. G. Brito.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Exploratory data analysis

In this appendix, we present exploratory data analysis to better describe the type of features used, how they are distributed, and how they are correlated to each other. In Table 9, we present the number of unique classes in each categorical feature as well as the amount of customers in the four most frequent classes of each categorical feature.

Table 9 Descriptive statistics of categorical features

Table 10 shows the number of customers in each binary features class, as well as the average between both classes, to show how they are distributed. Among these features, we highlight that the Purchase incidence feature has an average close to 1 (Average = 0.98). It shows how most customers are highly active. However, this does not necessarily mean that they will be retained, even though the churn event is rare. It indicates the challenge of modeling churn in this context.

Table 10 Descriptive statistics of binary features

Table 11 shows the following descriptive statistics of each real and integer feature: (1) average, (2) standard deviation, (3) minimum value, (4) median, and (5) maximum value. We highlight how the distributions of the purchasing recency per product differ significantly from the distribution of the overall purchase recency feature (specifically, credit purchase recency and investment purchase recency). It indicates how the feature engineering process based on the recency, frequency, and monetary value concept (RFM) was important to generate features that uncover a relevant level of variability which was not observable by only using the overall purchase recency feature. It provides a richer set of predictor features for the learning algorithms increasing the likelihood of predicting the churn more precisely.

Table 11 Descriptive statistics of real and integer features

Appendix B: Confusion matrix C1 and C2

Table 12 shows the average confusion matrix average of the 100-fold cross-validation procedure for the recommended configurations C1 and C2.

Table 12 Average confusion matrix for the 100-fold cross-validation experiment

Appendix C: Remaining features resulting from XGBoost

Table 13 shows the complete list remaining of original or feature engineering features resulting from XGBoost, sorted by descending Gain.

Table 13 Complete list of remaining features resulting from XGBoost

Appendix D: Share of wallet between churned and retained customers (Fig. 5)

Fig. 5
figure 5

Mann–Whitney test to compare the share of wallet between churned and retained customers

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brito, J.B.G., Bucco, G.B., Heldt, R. et al. A framework to improve churn prediction performance in retail banking. Financ Innov 10, 17 (2024). https://doi.org/10.1186/s40854-023-00558-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40854-023-00558-3

Keywords