1 Introduction

In censored regression, the outcomes are sometimes expressed as exact values and at other times as open intervals. For example, we might know the precise value of a substance concentration or just that it is above or below a given threshold. This kind of data is widespread in real-world applications. We know that the actual value is lower or higher than a threshold in the case of censorship. The opposite case is truncation, in which the value cannot be lower or higher than a given threshold. Data censoring and truncation can coexist, thus increasing the complexity of the problem. This may be the case in an actual application, especially when considering truncation from below by zero. For example, we might not be able to measure the concentration of a substance below a given threshold, but at the same time, we know that the concentration cannot take on a negative value. In this case, the censoring is from below, and the censored interval is also truncated at zero. The actual value will lie within a closed interval bounded by zero and the detection threshold. In this article, we focus on censoring and truncation from the perspective of deep learning.

The most popular model for handling censored data is the tobit model. It was pioneered in 1958 in the paper [1] by James Tobin [2]. The purpose of the analysis was to predict household expenditure on durable goods, and the challenge was to model the regression problem so that it took into account the fact that there cannot be negative expenditures [2]. Another application of censored regression from econometrics is [3], where the risk of failing to pay the loans of Swiss small and medium-sized enterprises is analyzed. Multiperiod financial returns are predicted in [4]. The authors try to construct the conditional probability distribution of the returns and note that the tails of the distribution are especially important for risk estimation tools [4]. The distribution estimation is also important for option pricing [5] and computing confidence intervals [6], as described in [4]. In [7], neurodegenerative clinical scores are predicted based on brain imaging (structural magnetic resonance imaging). The clinical scores have lower and upper bounds [7]. Precipitation prediction from meteorologic measurements taken from multiple stations across Germany was done in [8]. CATNAP [9] is a very large dataset containing experiments about the potency of various antibodies against HIV strains. However, some of the values measured in CATNAP are censored, and [10] used the tobit model [1, 2, 11] to process this data. The landscape of censored regression is vast. At the time of this writing, a search in the Web of Science database resulted in more than 7000 articles containing the words tobit or censored regression in their titles, abstracts, or keywords. Based on the research areas displayed by Web of Science, this topic is most prevalent in business economics (3690 articles), mathematics (1713 articles), environmental sciences ecology (1490 articles), agriculture (966 articles), health care sciences services (854 articles), and engineering (812 articles).

When we specify censoring or truncation, we always refer to the dependent variables and never to the independent variables. We can formalize the concepts of truncation and censoring as shown in Eqs. (1) and (2). It can be seen that \(y^*\), the true value or latent variable, is not directly observed; instead, the dataset will contain pairs of x and y variables, where y is the censored version of \(y^*\). This notation is also used in [2]. Truncation applies to the true value \(y^*\) and implicitly to y, given that y is an image of \(y^*\). Truncation means that the true value \(y^*\) cannot take values outside of the truncation bounds \(t_l\) and \(t_u\), while censoring means that the true value \(y^*\) does take values outside of the censoring thresholds \(c_l\) and \(c_u\) but that these values are unknown.

$$\begin{aligned} y^*& = f(x); \ t_l<y^*<t_u \end{aligned}$$
(1)
$$\begin{aligned} y& = {\left\{ \begin{array}{ll} c_l, & \quad \text {if}\ y^*\le c_l\\ y^*, & \quad \text {if}\ c_l<y^*<c_u\\ c_u, & \quad \text {if}\ y^*\ge c_u \end{array}\right. } \end{aligned}$$
(2)

where:

  • x is the independent variable, also known as input;

  • f(x) is a function that takes as input the independent variable and outputs the dependent variable \(y^*\);

  • \(y^*\) is the true value; in practice, this value is not always known because of the censoring;

  • y is the censored value of \(y^*\);

  • \(t_l\) and \(t_u\) are the lower and upper truncation thresholds;

  • \(c_l\) and \(c_u\) are the lower and upper censoring thresholds.

In our case, the function f(x) represents a neural network. Throughout this paper, we refer to the interval \((c_l, c_u)\) as the uncensored interval or uncensored domain and to the interval \(({-\infty ,\ c}_l] \cup [c_u,\ \infty )\) as the censored interval or censored domain. Truncation at \(t_l\) or \(t_u\) can occur within or outside the censored interval. If truncation occurs within the uncensored interval (for the uncensored samples), this will not pose a significant challenge, and the neural network will naturally handle the truncation due to its capacity to model highly nonlinear and complex data. However, if the truncation occurs within the censored interval, this is a much more complex situation. The model learns how to regress values in the censored interval by leveraging the censoring bounds \(c_l\) and \(c_u\) and the information from the uncensored samples, without learning from exact data. This is certainly much more challenging than classical regression. One problem with neural networks is that they tend to overfit and output extreme values for the censored interval, as shown in the experiments described in Sect. 5. For this reason, neural networks may output values that are beyond the truncation limit. We address these challenges by designing the loss functions to take into account not only censoring but also truncated censoring, or in other words, the case where the truncation limit \(t_l\) or \(t_u\) is included in the censored interval \(({-\infty ,\ c}_l] \cup [c_u,\ \infty )\). Truncation limits can be used even when there are no known limits, as a regularization technique to prevent the generation of extreme values. We measure the performance of our model in the uncensored and censored domains, with a particular interest in the censored domain, which is the more complicated case. In a typical censored regression problem, the actual values of the censored outcomes are unknown; however, to measure the model’s performance with respect to the censored domain, we use an uncensored dataset and censor the outcomes above and below a certain threshold. We use the censored variant of the dataset in the training and validation stages and use the full uncensored dataset for the final evaluation. We also perform ablation experiments in which we remove from the loss functions the parts relating to truncation, censoring, and both.

The current article focuses on ways to deal with censoring and truncation by adapting the loss functions. We are not concerned with optimizing networks for a given task. Instead, we try to isolate the effects of various components of our loss functions while holding everything else constant. We chose situations in which classical loss functions fail due to the effects of censoring and truncation. We showcase how changing the loss functions corrects the problems and results in better predictions.

The landscape of neural networks is vast, with many architectures and variations. At the same time, the literature on censored regression, especially on tobit models, has evolved significantly over the years. The current article has an exploratory nature and does not try to examine a wide range of network architectures. Due to the increased complexity of such an analysis and computational constraints, we limited the scope of the current study to fully connected networks which are not very deep. We found that networks with three layers are working well for the majority of our experiments. Also, the scope of this article only covers the tobit type one as defined in [2] and the Gaussian distribution.

Our work comes with the following innovations:

  • We are unaware of other studies that measure the performance of a neural network for censored regression over the censored domain.

  • To the best of our knowledge, our work is the first to use the tobit model as a loss function for training neural networks in an end-to-end manner using gradient backpropagation.

  • The losses we propose are more generic than the existing approaches because we also take data truncation into account, which is a separate notion from data censoring. We are unaware of any other work that considers data truncation in combination with the tobit likelihood function for training neural networks.

  • The standard deviation of the error is a component of the tobit likelihood loss function; we introduce further innovations by learning this component in the network training process as opposed to other approaches where it is considered a hyperparameter. Our models also allow the standard deviation of the error to be dynamic and estimated by a separate network, a technique not investigated elsewhere.

  • Finally, we experiment with a reparametrization of the tobit likelihood, which we are unaware of being used for neural network training by others.

2 Related works

In the related works section, we describe one of the most popular methods for censored regression—the tobit model. In the following subsections, we highlight other related papers. Some of those inspired our work, and we describe the differences between their approaches and ours. We end with a summary of censored regression and deep learning from the perspective of survival analysis.

2.1 Overview of the tobit likelihood

The tobit model is historically one of the most popular and effective ways of performing censored regression. It has undergone many variations but has mainly been applied to linear models. In our opinion, adapting the tobit model to deep learning practices may have a significant impact. It might open up new research possibilities in all areas that rely on censored regression. The tobit model optimizes a likelihood function that combines a probability density function (pdf) and a cumulative density function (cdf). The pdf expresses the likelihood of exact values, whereas the cdf expresses the likelihood of censored intervals. The pdf and cdf usually belong to the same distribution, but there are variations in which they do not [2].

The tobit likelihood function is described in [1, 2, 11]; in the current paper, we summarize this function in Eqs. (3)–(6) and modify some of the notations and formulae without changing their mathematical meanings. The original model developed in [1] was censored from below by zero, and the same Gaussian distribution was assumed for the pdf and cdf. The true value \(y^*\) was estimated by a function f(x) plus an error \(\varepsilon\), which was assumed to have a Gaussian distribution with mean zero and standard deviation \(\sigma\), as follows: \(y^*=f(x)+ \varepsilon\); \(\varepsilon \sim N(0, \sigma ^2)\). In the classical tobit model, f(x) is a linear function [1, 2], whereas in our work, it is a neural network. The probability of an uncensored outcome is the probability of the true value \(y^*\) taking a specific value y (the observed dependent variable), and this is expressed using the pdf of the standardized error:

$$\begin{aligned} \begin{aligned} Pr(y^*=y)&=Pr(f(x)+\varepsilon =y)\\&=Pr(\varepsilon =y-f(x))\\&=Pr\left( \frac{\varepsilon }{\sigma }=\frac{y-f(x)}{\sigma }\right) \\&=\frac{1}{\sigma }\varphi \left( \frac{y-f(x)}{\sigma }\right) \end{aligned} \end{aligned}$$
(3)

where

  • \(\sigma\) is the standard deviation of the error;

  • \(\varphi\) is the normal pdf, having zero mean and unitary variance.

In this section, we derive the tobit likelihood function based on the assumption of a lower censoring limit without truncation, in a similar way to the original model [1], which was censored by zero. In Sect. 3.2, we analyze the case with both lower and upper censoring and truncation. The probability of a censored outcome, which is the probability of the true value \(y^*\) being lower than the lower censoring threshold \(c_l\), is expressed in terms of the cdf of the standardized error, as follows:

$$\begin{aligned} \begin{aligned} Pr(y^*\le c_l)&=\Pr (f(x)+\varepsilon \le c_l)\\&=Pr(\varepsilon \le c_l-f(x))\\&=Pr\left( \frac{\varepsilon }{\sigma }\le \frac{c_l-f(x)}{\sigma }\right) \\&=\Phi \left( \frac{c_l-f(x)}{\sigma }\right) \end{aligned} \end{aligned}$$
(4)

where

  • \(\Phi\) is the normal cdf, having zero mean and unitary variance.

The likelihood function L is obtained by multiplying all the probabilities for the censored and uncensored observations, as shown in (5). The log-likelihood is obtained by applying a logarithm to the likelihood function (typically the natural logarithm), as shown in (6). The log-likelihood function ln(L) can be maximized with respect to the function f(x) and \(\sigma\); in other words, f(x) and \(\sigma\) are the parameters that we need to find. In the original paper [1], these parameters were found using the method of maximum likelihood. However, in [2], more methods for solving the optimization problem were studied, including the probit maximum likelihood, least squares, Heckman’s two-step estimator, nonlinear least squares, nonlinear weighted least squares, the tobit maximum likelihood estimator, and expectancy maximization. In our work, we also use a variation in the tobit likelihood, which corresponds to the approach in [12]. In this method ([12], as cited in [2, 11]), the likelihood function undergoes a reparametrization that enables the likelihood to be globally concave with respect to the new set of parameters. Equation (7) presents this new form of the likelihood function in which the formulae from [2, 11, 12] are adapted:

$$\begin{aligned} L& = \prod _{i}{\frac{1}{\sigma }\varphi \left( \frac{y_i-f(x_i)}{\sigma }\right) }\prod _{j}\Phi \left( \frac{c_l-f(x_j)}{\sigma }\right) \end{aligned}$$
(5)
$$\begin{aligned} ln(L)& = \sum _{i}{\left[ ln\left( \varphi \left( \frac{y_i-f(x_i)}{\sigma } \right) \right) -ln\left( \sigma \right) \right] }\nonumber \\ & +\sum _{j}{ln\left( \Phi \left( \frac{c_l-f(x_j)}{\sigma }\right) \right) }\nonumber \\\approx & -\sum _{i}{\left[ \frac{1}{2}\left( \frac{y_i-f(x_i)}{\sigma } \right) ^2+ln\left( \sigma \right) \right] }\nonumber \\ & +\sum _{j}{ln\left( \Phi \left( \frac{c_l-f(x_j)}{\sigma }\right) \right) } \end{aligned}$$
(6)
$$\begin{aligned} \gamma& = 1/\sigma \nonumber \\ \delta (x)& = \gamma \cdot f(x)\nonumber \\ ln(L)\approx & \sum _{i}{\left[ -\frac{1}{2}({\gamma y}_i -\delta (x_i))^2+ln(\gamma )\right] }\nonumber \\ & +\sum _{j}{ln(\Phi (\gamma c_l-\delta (x_j)))} \end{aligned}$$
(7)

where

  • \(\gamma\) is the inverse of the standard deviation of the error;

  • \(\delta (x)\) is a neural network that takes as input the independent variable x and estimates the true value \(y^*\) multiplied by the inverse of the standard deviation \(\gamma\);

The tobit model has been extensively applied by researchers and has numerous variations. Takeshi Amemiya [2] carried out a comprehensive literature review of these tobit models and divided them into five categories. The model presented here corresponds to type one and is the simplest form. Also, our implementation is based on the type one tobit. In type one, the value of \(y^*\) determines whether censoring occurs, and the observed values y are also an image of \(y^*\). This varies for the other types of tobit models; for example, in a type two tobit, the censoring is based on the value of a latent variable, although the observed variable subject to censoring is a different one [2]. In [13], the error distribution of the tobit model is extended to other distributions from the elliptically contoured distributions family such as Laplace, logistic, and the student’s T distributions. In [14], the log-likelihood of the tobit is optimized indirectly by leveraging a variational lower bound of the tobit log-likelihood. The authors use this technique to adapt the Gaussian process regression algorithm to censored data [14].

2.2 Subspace network

In our opinion, the work that is the closest to ours is the subspace network [7]. In this section, we elaborate on the similarities and differences between our work and the subspace network. One similarity is that in [7], as in our approach, the tobit likelihood function is used as a loss function to train neural networks through gradient backpropagation.

Another similarity is that the mean squared error (MSE) was combined with a rectified linear unit (ReLU) activation function to achieve censoring of the network output [7]. The ReLU was inserted before the loss calculation [7]. The MSE loss and the ReLU function were used similarly in [15]. In addition, the authors in [15] extend this method by combining the MSE and the cross-entropy losses for handling scenarios where the observation of the censored variable depends on a separate variable, similar to the type two tobit models [2]. We applied the same ideas from [7, 15] when creating our MSE-based loss for censored regression. The difference is that our method can handle truncated censoring as well.

Unlike in our work, the networks in [7] are more shallow, have a fixed structure, and are not trained in an end-to-end way. The network in [7] can be expressed as \(ReLU(U^TVx+\varepsilon )\), where \(\varepsilon\) is the error. The parameters U and V are learned through a technique named block coordinate descent [7], an iterative method in which U and V are updated alternately: first, U is kept constant while V is updated, and then, U is updated while V is kept constant. U and V are updated in turns through gradient descent performed on a loss function corresponding to the tobit likelihood plus regularization terms [7]. In our work, the gradient is propagated throughout all of the network layers at each gradient descent update, and we do not use the alternate layer training approach from [7]. Our training procedure is the standard gradient descent training used for neural networks; in our opinion, this has the advantages of both simplicity in terms of software development and flexibility since the networks trained in this fashion are not constrained to a particular structure. In [7], deeper structures were formed by stacking several shallow networks, and the training took place by turns, one network at a time. There was no end-to-end training of these networks, and each network was trained using the output from the previous network and the original input x, a technique similar to boosting [7]. In our case, the standard deviation \(\sigma\) is learned jointly with the rest of the network training, while in [7], it was estimated from the data.

To the best of our knowledge, our work is the first to use the tobit model as a loss function for training neural networks in an end-to-end manner using gradient backpropagation. This may be because the tobit model is traditionally viewed as a linear model and is not usually associated with deep learning; it may also be because training a network using a tobit likelihood function without employing the truncation techniques that we will discuss in the following sections is likely to output extreme values, giving poor results. Furthermore, the data need a particular type of preprocessing, which increases the complexity of this implementation. We will discuss all these aspects in Sect. 4. We introduce further innovations by experimenting with additional variations to handle the standard deviation \(\sigma\) and by using the reparametrized tobit likelihood from [12]. The truncated censoring that was briefly mentioned in the introductory section is another important innovation in terms of training neural networks with the tobit likelihood.

2.3 Gradient tree-boosted tobit model

Grabit is a decision tree-based learner that leverages the tobit likelihood as a loss function in a gradient boosting algorithm. Grabit is used for a binary classification problem [3]. To alleviate class imbalance and data scarcity problems for the rare class, the authors leveraged an auxiliary variable that was correlated with the classification task [3]. The classification offers a rigid view that does not take into consideration the fact that there may be samples near the decision boundary that are similar; on the other hand, regressing a continuous auxiliary variable is a better method of modeling these cases [3]. We note that the first type of tobit likelihood loss was used in [3]; nevertheless, for the cases in which the decision function and the auxiliary variable are different and less correlated, in theory, the tobit type two likelihood would be more appropriate, as described in [2].

2.4 Quantile regression neural networks

In statistics, a quantile is a value that splits a set of numbers so that \(p\%\) of the numbers are lower than the quantile and \((1-p)\%\) of the values are higher, where p is a probability. For example, the \(25\%\) quantile, also known as the first quantile, splits the dataset so that \(25\%\) of the samples are lower than the quantile and \(75\%\) are higher, while the median splits the dataset into equal parts. We will explain why quantiles are useful for censored regression through a simple example. Suppose we have a series of numbers: 1, 3, 5, 7, 9. We can find the median and the first quantile (corresponding to \(25\%\) of the samples) by counting the elements in the series: the median is 5, and the first quantile (\(25\%\)) is 3. Now, suppose we are censoring the series of numbers at 6; thus, the series becomes 1, 3, 5, 6, 6. By applying the same count, we can see that the median and first quantile remain unchanged by censoring, unlike the mean. A quantile can be estimated through an optimization problem [16, 17], as documented in [4]. Works that have used neural networks for quantile regression include [18, 19], as cited in [4]. The censored least absolute deviations (LAD) function [8, 20] can be minimized to find the parameters of f(x), the function that estimates the quantile. Equation (8) is an adapted version of the censored LAD function used in [8, 20].

$$\begin{aligned} \begin{aligned} \Psi (e)&= {\left\{ \begin{array}{ll} p\cdot e, & \quad \text {if}\ e\ge 0\\ (p-1)e, & \quad \text {if}\ e<0 \end{array}\right. }\\ L&=\sum _{i}{\Psi (y_i-max(c_l,f(x_i)))} \end{aligned} \end{aligned}$$
(8)

where

  • p is the probability corresponding to the quantile;

  • e is the error, equal to \(y_i-max(c_l,f(x_i))\);

In [21], censored quantile regression was extended to neural networks. Also, [21] provided the possibility of estimating the complete pdf based on the quantiles. In Eq. (8), the value of the \(\Psi\) function is always positive; if we set the probability p in Eq. (8) to \(50\%\) to estimate the median, we are left with the mean absolute error (MAE) function divided by two. In the context of training a neural network, we can drop the division by two since this only acts as a scale factor for the learning rate. Inspired by the work in [4, 8, 21], we experiment with a loss that uses the MAE for censored regression. This loss also takes into account the truncation that can occur inside the censored domain.

2.5 Tobit on top of deep learning features

In [22], a multivariate tobit regression is applied to features computed using unsupervised deep learning, with the aim of predicting the risk of traffic collisions. A system composed of three modules is used [22]. The first module is a deep learning denoising autoencoder that computes the input representations through unsupervised training [22]. The features calculated by the autoencoder (the first module) are fed into a multivariate tobit model that forms the second module [22]. The third module leverages the predictions from the second module and builds a state-space model based on an ensemble Kalman filter [23, 24], as cited in [22]. The use of the tobit model on top of the deep learning unsupervised features may provide advantages through a reduction in overfitting; however, in our opinion, using the tobit loss described in the present article, the two-step process can be turned into a one-step process by training a deep learning version of the tobit model, which extracts features and performs regression jointly.

2.6 Survival analysis

Censored data typically appear in survival analysis applications, where the occurrence of events is analyzed within a given timeframe. Examples of events include the failure of industrial machinery, cure, or death of a patient (see, e.g., the use of statistical models and machine learning approaches applied to real-world COVID-19 data [25]). Typical use cases include the prediction of event occurrence, time-dependent risk assessments, and risk-based decisions. The data are censored relative to time; for example, a patient in a study is followed only for a finite period. Survival analysis is related to censored regression, but the two are not the same, since in survival analysis we have the dimension of time, and the event of interest typically occurs only once, although there are also survival analysis methods for recurring events [26, 27]. A survival analysis problem can be viewed as a censored regression problem, but not every censored regression problem can be viewed as a survival analysis problem.

In our opinion, the most well-known ways of modeling survival analysis are Kaplan–Meier [28], Cox proportional hazards [29], and accelerated failure time [30]. Decision tree-based models include survival trees [31, 32] and survival random forests [32, 33]. Survival analysis is a vast topic, and an overview is outside the scope of the current article; hence, we only mention a few selected papers related to survival analysis and deep learning and direct the reader to other reviews for a deeper understanding of the subject. The interested reader can find an introduction to survival analysis in [34] and in the literature reviews in [26, 35, 36]. A comprehensive review is found in [37], which covers in great detail the literature on Cox proportional hazards, alternatives to the proportional hazards model, and other advanced topics such as frailty, competing risks, multiple failures, and multi-dimensional survival data. Frailty models are extensively discussed in the review in [38].

Neural networks have also been used. In [39, 40], the loss function is similar to the partial log-likelihood from the Cox proportional hazards model. In [41], the period is discretized into n intervals, and the model outputs a vector of size n, where each vector element represents the probability of survival within that time interval. In survival analysis, sequential data are often encountered, and recurrent neural networks (RNNs) are a natural choice. The losses used to train recursive networks are diverse. A likelihood-based loss is often employed. In [42], the hazard for a sample is estimated by the RNN for each time interval and is optimized using a loss function composed of two parts: one that minimizes the negative log-likelihood corresponding to event occurrence and one that is the cross-entropy of the survival. For discrete time intervals, the likelihood is the product of the probabilities of the event happening in a given interval (the hazard) multiplied by the probability that the event did not happen until that time (the survival) [42]. By combining RNNs with survival analysis, the return times of website users were predicted in [43], and fraud detection in social media was carried out in [44]. In [45,46,47], the RNN was trained using likelihood functions that assumed the Weibull probabilistic distribution. The authors of [46] leveraged the theory of renewal processes, and the interested reader can find more about these in [48]. The Weibull and log-normal distributions were combined with deep learning for the assessment of competing risks in [49]. DeepHit is another model that was designed for the problem of competing risks [50]. The loss function has two components, one of which is likelihood based and the other ranking based [50]. The work in [51] extended the scheme in [50] by changing the network from a fully connected to a gated recurrent unit (GRU) [52] with attention. In addition to the two losses described in [50], a third loss was added to regularize the GRU by predicting the covariates in the next timestep [51]. In [53], a long short-term memory network [54] was trained with two losses: a cross-entropy loss modified for censored data and a ranking loss that was the negative C-index upper bound [53, 55]. The C-index is a well-established evaluation metric for survival analysis that measures how well the predictions are ranked. The loss function presented in [56] combines the Kullback–Leibler divergence between the estimated survival and the binary outcomes at each timestep with a sum of penalties that enforces a decrease in the survival probability over time. In [47], the authors experimented with two networks: the first estimated the parameters of a Weibull distribution and the second was trained using multi-task learning [57], as cited in [47]. The multitask learning objective was a combination of cross-entropy, squared error, and censored absolute error [47]. A generative adversarial network (GAN) [58] was used to model the distribution of event times in [59].

3 Methods

The methods section describes the loss functions we are proposing for training neural networks on the task of censored regression. It contains two subsections: one related to losses based on the mean absolute error (MAE) and mean squared error (MSE) and another related to tobit-based losses.

3.1 Censored mean absolute error and censored mean squared error losses

Equations (1) and (2) formalize the concepts of truncation and censoring. In truncation, the true value \(y^*\) takes values in the interval \((t_l,\ t_u)\), where \(t_l\) and \(t_u\) are the lower and upper truncation limits. The variable y is the censored image of the actual variable \(y^*\), and \(c_l\) and \(c_u\) are the lower and upper censoring thresholds. We refer to the interval \((c_l, c_u)\) as the uncensored interval or uncensored domain, and to the interval \(({-\infty ,\ c}_l] \cup [c_u,\ \infty )\) as the censored interval or censored domain. When the true value \(y^*\) takes values in the uncensored interval \((c_l, c_u)\), the censored variable y is equal to the true value \(y^*\), and when the true value \(y^*\) takes values in the censored interval \(({-\infty ,\ c}_l] \cup [c_u,\ \infty )\), the censored variable y is equal to one of the censoring limits \(c_l\) or \(c_u\). The censored variable y is the observed ground truth, while the true value \(y^*\) is what we are aiming to estimate. The truncation of \(y^*\) is also the same as the truncation of y, since y will always take values within a smaller interval than \(y^*\). Truncation is likely to pose challenges when it occurs in the censored interval, as in this case, the exact values are not observed. The loss definitions used throughout this section will use the same notations as in Eqs. (1) and (2). The censored mean absolute error (CMAE) is used to estimate the conditional median, and the theoretical justification for using it is detailed in Sect. 2.4. The CMAE can be regarded as a typical MAE in which the network output is censored; in this approach, the gradient is propagated through the network only if the output of the network is within the censored interval \((c_l, c_u)\). This has an intuitive explanation as follows: if the network f(x) outputs a value smaller than the lower censoring limit \(c_l\) or greater than the upper censoring limit \(c_u\), this is not considered an error, and the loss is, therefore, zero in these cases. In contrast, for truncation, the loss should penalize the network for outputting values outside of the truncation bounds \(t_l\) and \(t_u\). The censored mean squared error (CMSE) is defined similarly to the CMAE, except that the absolute errors are replaced with squared errors. Equations (9), (10), (11), (12), and (13) show the CMAE and CMSE losses.

$$\begin{aligned} {CMAE}_T& = \sum _i\big [\vert f_c(x_i)-y_i \vert + \vert f_{tl}(x_i)-t_l \vert + \vert f_{tu}(x_i)-t_u \vert \big ] \end{aligned}$$
(9)
$$\begin{aligned} {CMSE}_T& = \sum _i\big [(f_c(x_i)-y_i)^2+(f_{tl}(x_i)-t_l)^2+(f_{tu}(x_i)-t_u)^2\big ] \end{aligned}$$
(10)
$$\begin{aligned} f_c(x)& = {\left\{ \begin{array}{ll} c_l, & \quad \text {if}\ f(x)\le c_l\\ f(x), & \quad \text {if}\ c_l<f(x)<c_u\\ c_u, & \quad \text {if}\ f(x)\ge c_u \end{array}\right. } \end{aligned}$$
(11)
$$\begin{aligned} f_{tl}(x)& = {\left\{ \begin{array}{ll} f(x), & \quad \text {if}\ f(x) < t_l\\ t_l, & \quad \text {otherwise} \end{array}\right. } \end{aligned}$$
(12)
$$\begin{aligned} f_{tu}(x)& = {\left\{ \begin{array}{ll} f(x), & \quad \text {if}\ f(x) > t_u\\ t_u, & \quad \text {otherwise} \end{array}\right. } \end{aligned}$$
(13)

In the loss described above in Eqs. (9), (10), (11), (12), and (13), the network output is always clamped between the censoring thresholds. Suppose our dataset contains an uncensored sample with a ground truth close to the censoring threshold. At the same time, suppose our model estimates a value that is farther away from the ground truth and lies in the censored domain. In other words, the ground truth is in the uncensored domain, and the prediction is in the censored domain. In this case, the distance between the uncensored ground truth and the censoring threshold is smaller than the distance between the uncensored ground truth and the actual model prediction. Hence, in this case, the loss is smaller than it should be, and the learning signal is weaker despite the ground truth being uncensored. One might think that this problem could be solved by clamping only the censored samples rather than clamping all samples; however, we generally advise against this, because in our opinion, this introduces a learning bias. We can understand why this is true if we look at the problem from the opposite direction. This time, suppose we have a censored ground truth and that the model predicts a value within the uncensored domain. The loss will be computed between the model prediction that falls into the uncensored domain and the censoring threshold. In this case, there is no way to compute the loss between the actual ground truth and the model prediction, because the ground truth is unknown (as it is censored). If we clamp only the censored samples, there will be a disproportion between the gradients for the samples that are uncensored but are predicted in the censored domain and the samples that are censored but predicted in the uncensored domain.

3.2 Tobit likelihood-based losses

Motivated by the theory of the tobit model detailed in Sect. 2.1, and inspired by the works in [3, 7, 22], we created four loss functions based on the tobit log-likelihood. The difference between them is the handling of the standard deviation \(\sigma\). We look at four cases:

  1. 1.

    \(\sigma\) is a fixed value that is learned end-to-end during training;

  2. 2.

    \(\sigma\) is estimated by a dedicated neural network;

  3. 3.

    we explore the reparametrization from [2, 11, 12];

  4. 4.

    we combine the reparametrization and the heteroscedastic error.

We split the batches of training samples into three parts: uncensored samples, lower-censored samples, and upper-censored samples. The likelihood for the uncensored samples is the normal pdf, as shown in Eq. (3), while the likelihood for the lower-censored samples is the normal cdf, as shown in Eq. (4). For the upper-censored samples, the likelihood is one minus the normal cdf, which is equal to the cdf of the negative value, as shown in Eq. (14). Note that for a normally distributed random variable z, \(1-cdf(z)=cdf(-z)\).

$$\begin{aligned} \begin{aligned} Pr\left( y^*\ge c_u\right)&=1-Pr\left( y^*< c_u\right) \\&=1-\Phi \left( \frac{c_u-f(x)}{\sigma }\right) \\&=\Phi \left( \frac{f(x)-c_u}{\sigma }\right) \end{aligned} \end{aligned}$$
(14)

The tobit-based loss is split into three components that deal with the uncensored, lower-censored, and upper-censored samples, as shown in (15)–(18).

$$\begin{aligned} L& = L_p+L_l+L_u \end{aligned}$$
(15)
$$\begin{aligned} L_p& = \sum _{i}\left[ \frac{1}{2}\left( \frac{y_i-f(x_i)}{\vert \sigma \vert }\right) ^2+ln(\vert \sigma \vert +\varepsilon )\right] \end{aligned}$$
(16)
$$\begin{aligned} L_l& = -\sum _{j} l n\left( \Phi \left( \frac{c_l-f(x_j)}{\vert \sigma \vert }\right) +\varepsilon \right) \end{aligned}$$
(17)
$$\begin{aligned} L_u& = -\sum _{k} ln\left( \Phi \left( \frac{f\left( x_k\right) -c_u}{\vert \sigma \vert }\right) +\varepsilon \right) \end{aligned}$$
(18)

Equations (16), (17), and (18) were derived using the theory presented in Sect. 2.1. Note that the losses have negative values for the log-likelihoods; also note the swapping of the terms \(f\left( x_k\right)\) and \(c_u\) from Eq. (18) compared to Eq. (17), which is explained by Eq. (14). The value \(\varepsilon\) is included with all logarithms for numerical stability, as the logarithm takes very large negative values for arguments close to zero. In our experiments, we set \(\varepsilon\) to \(1\textrm{e}{-40}\). We make sure the standard deviation of the error is always positive by taking the absolute value of the parameter \(\sigma\). The parameter \(\sigma\) is learned end-to-end with the rest of the network during training.

Next, we present the tobit-based loss that accounts for truncation. The probability that the true value \(y^*\) is below or above the truncation limit can be expressed using the cdf, as shown in Eqs. (19) and (20). The derivations of Eqs. (19) and (20) are similar to those of Eqs. (4) and (14).

$$\begin{aligned} Pr\left( y^*\le t_l\right)& = \Phi \left( \frac{t_l-f(x)}{\sigma }\right) \end{aligned}$$
(19)
$$\begin{aligned} P\left( y^*\ge t_u\right)& = \Phi \left( \frac{f(x)-t_u}{\sigma }\right) \end{aligned}$$
(20)

We only address the case where the truncation interval is included in the censoring interval \((-\infty ,\ t_l]\subset (-\infty ,\ cl]\) and \([t_u,\ \infty )\subset [c_u,\ \infty )\); in other words, \(t_l<c_l\) and \(t_u>c_u\). If the truncation lies in the uncensored interval \(t_l\in (c_l,c_u)\) or \(t_u\in (c_l,c_u)\), this poses no challenge since neural networks can fit nonlinear data. To account for truncation, we subtract a cdf representing the probability that the actual value \(y^*\) is within the truncation interval from the cdf corresponding to the likelihood of the censored samples. Another way to think about this is that we compute the likelihood of \(y^*\in (t_l,c_l]\), which is equal to the likelihood of \(y^*\in (-\infty ,c_l]\) minus the likelihood of \(y^*\in (-\infty ,t_l]\) for lower censoring and truncation. Similarly, for upper censoring and truncation, the likelihood of \(y^*\in [c_u,t_u)\) is equal to the likelihood of \(y^*\in [c_l,\infty )\) minus the likelihood of \(y^*\in [t_l,\infty )\), as shown in Eqs. (21) and (22).

$$\begin{aligned} Pr\big (y^*\in (t_l,c_l]\big )& = Pr(y^*\le c_l)-Pr(y^*\le t_l)\nonumber \\& = \Phi \left( \frac{c_l-f(x)}{\sigma }\right) -\Phi \left( \frac{t_l-f(x)}{\sigma }\right) ; \nonumber \\ & \quad given\ c_l>t_l \end{aligned}$$
(21)
$$\begin{aligned} Pr\big (y^*\in [c_u,t_u)\big )& = Pr(y^*\ge c_u)-Pr(y^*\ge t_u)\nonumber \\& = \Phi \left( \frac{f(x)-c_u}{\sigma }\right) -\Phi \left( \frac{f(x)-t_u}{\sigma } \right) ; \nonumber \\ & \quad given\ c_u<t_u \end{aligned}$$
(22)

The loss function that accounts for truncation is expressed in Eqs. (23), (24), (25), and (26). The loss presented in the current section and the remainder of the variations derived here accounts for both lower and upper truncation; however, the reader should be aware that in our experiments, we only used truncation from below due to the nature of our datasets. Hence, the upper truncation presented in the losses was not experimentally validated, and only the lower truncation was considered. Nevertheless, we can theoretically justify the expressions for the upper truncation based on Eqs. (14), (20), and (22). The interested reader can find the versions of the losses without upper truncation in Appendix A, where we present the losses as used in the experiments. In Appendix A, we also present versions without truncation or censoring as part of our ablation experiments.

$$\begin{aligned} L& = L_p+L_l+L_u \end{aligned}$$
(23)
$$\begin{aligned} L_p& = \sum _{i}\left[ \frac{1}{2}\left( \frac{y_i-f(x_i)}{\vert \sigma \vert }\right) ^2+ln(\vert \sigma \vert +\varepsilon )\right] \end{aligned}$$
(24)
$$\begin{aligned} L_l& = -\sum _{j} ln\Bigg (\Phi \left( \frac{c_l-f(x_j)}{\vert \sigma \vert }\right) -\Phi \left( \frac{t_l-f(x_j)}{\vert \sigma \vert }\right) +\varepsilon \Bigg ) \end{aligned}$$
(25)
$$\begin{aligned} L_u& = -\sum _{k} ln\Bigg (\Phi \left( \frac{f(x_k)-c_u}{\vert \sigma \vert } \right) -\Phi \left( \frac{f(x_k)-t_u}{\vert \sigma \vert }\right) +\varepsilon \Bigg ) \end{aligned}$$
(26)

We could also consider \(\sigma\) as a hyperparameter and find it through experimentation. However, in the present paper, we take a different approach and learn it through gradient descent. In the previous losses, \(\sigma\) was a fixed scalar; however, we can make it dynamic by using a linear function or a two-layer network to estimate it and to train it end-to-end with the regression network. This might help when dealing with highly heteroscedastic data. The results of the experiments described in Sect. 5 show that this approach yields unsatisfactory results, possibly due to overfitting and numerical instability. However, the same concept of dynamic \(\sigma\) combined with the reparametrization scheme from [12] provides promising results.

Another variation on the tobit loss is based on the reparametrization from [12], as shown in Eq. (7). The reparametrized likelihood is expressed in terms of the inverse of the standard deviation \(\gamma =1/\sigma\) and \(\delta \left( x\right) =\gamma \cdot f\left( x\right)\), where \(\delta \left( x\right)\) is the output of the neural network. The network \(\delta \left( x\right)\) estimates the true value \(y^*\) scaled by \(\gamma =1/\sigma\). Hence, at training time, the loss is optimized with respect to \(\gamma\) and \(\delta \left( x\right)\), and in order to infer the results \(y^*\) (at evaluation time or in production), the output of the network \(\delta \left( x\right)\) needs to be divided by \(\gamma\), as shown in Eq. (31). Similarly, the standard deviation \(\sigma\) is found by inverting \(\gamma\), i.e., \(\sigma =1/\gamma\). This reparametrization has the advantage that the likelihood is globally concave with respect to the new set of parameters \(\gamma\) and \(\delta \left( x\right)\) [2, 11, 12]. The reparametrization-based tobit loss is formalized in Eqs. (27), (28), (29), and (30), and Eq. (31) is used for inference. These expressions can be derived from Eqs. (23), (24), (25), and (26) by substituting \(\delta \left( x\right) /\gamma\) for f(x) and \(1/\gamma\) for \(\sigma\). The final variationin the tobit likelihood consists of a combination of the reparametrization from [12] and a dynamic standard deviation estimated with a neural network. In this case, the network estimates \(\gamma\), which is equal to \(1/\sigma\).

$$\begin{aligned} L& = L_p+L_l+L_u \end{aligned}$$
(27)
$$\begin{aligned} L_p& = \sum _{i}\left[ \frac{1}{2}\left( {\vert \gamma \vert y}_i -\delta (x_i)\right) ^2-ln(\vert \gamma \vert +\varepsilon )\right] \end{aligned}$$
(28)
$$\begin{aligned} L_l& = -\sum _{j} l n\big (\Phi \left( \vert \gamma \vert c_l -\delta (x_j)\right) -\Phi \left( \vert \gamma \vert t_l-\delta (x_j)\right) +\varepsilon \big ) \end{aligned}$$
(29)
$$\begin{aligned} L_u& = -\sum _{k} ln\big (\Phi \left( \delta (x_k)-\vert \gamma \vert c_u\right) -\Phi \left( \delta (x_k)-\vert \gamma \vert t_u\right) +\varepsilon \big ) \end{aligned}$$
(30)
$$\begin{aligned} y^*& = \delta (x)/\vert \gamma \vert \nonumber \\ \sigma& = 1/\vert \gamma \vert \end{aligned}$$
(31)

4 Experimental setup

This section describes the experimental setup, such as datasets, neural network models, optimization techniques, model evaluation, experiments, and data visualizations.

4.1 Datasets

We used four datasets. Two were synthetically generated, and two consisted of real data. The synthetic datasets were univariate and were generated by adding noise to a function and then censoring it from both above and below. The chosen function was the pdf of the beta distribution [60] parametrized by \(\alpha =2\) and \(\beta =4\). The noise distribution was Gaussian, with mean \(\mu =0\) and standard deviation \(\sigma =0.3\). The upper and lower censoring thresholds were \(c_l=0.3\) and \(c_u=1.7\). We generated 10, 000 points for the training and testing sets and 1000 points for the validation set. Note that we did not sample from the beta distribution; instead, we sampled the input x from a uniform distribution between zero and one, which is the valid domain for the beta distribution, and the ground truth y was the beta distribution pdf of x plus random noise. We chose the beta distribution as the backbone of the datasets as it is nonlinear, skewed, and truncated from below by zero, since it is a probability. The addition of noise and censoring creates a more challenging problem. To test the losses in the case where the standard deviation is dynamically estimated by a neural network, we generated a heteroscedastic dataset in which we used the same beta pdf function, but where the noise depended on the value of the ground truth y. In this case, the noise sampled from the Gaussian distribution with mean \(\mu =0\) and standard deviation \(\sigma =0.3\) was multiplied by the ground truth: \(noise=N(\mu ,\sigma )\cdot y\). We refer to this dataset as the heteroscedastic dataset and to the other as the constant noise dataset.

Fig. 1
figure 1

The left column, a, c contains the training sets, and the right column, b, d contains the testing sets. The validation sets are not presented but are similar to the training sets. Note the effect of censoring on the training sets compared to the uncensored testing sets. The images a and b display the generated data sets having a constant noise, while c and d show the generated datasets having dynamic noise, also called heteroscedastic. Note how the samples disperse more for higher values of y in the heteroscedastic datasets as opposed to the constant noise datasets. The orange curves in (a), (b), (c), (d) represent the beta pdf function to which we applied the noise, and the blue dots represent the actual data samples

Fig. 2
figure 2

The left column a, c contains the training set, and the right column, b, d contains the testing set for the real datasets. The validation sets are not presented but are similar to the training sets. Note the effect of censoring on the training sets compared to the uncensored testing sets. The images a and b correspond to the Beijing PM2.5 dataset, and the images c and d correspond to the Seoul bike sharing dataset. In all visualizations, the input features are made unidimensional through PCA

The first real dataset was the Beijing PM2.5 dataset [61, 62] which is available from the UCI Machine Learning Repository [63]. PM2.5 is particulate matter smaller than 2.5 micrometers and is a marker for air pollution [64]. This dataset is multivariate and contains meteorological data and hourly PM2.5 measurements taken between 2010 and 2014. We built the training set with data from 2010 to 2012, the validation set from 2013, and the testing set from 2014. Temporal data included the year, month, day, and hour, but to avoid overfitting, we dropped the day and year, which were too specific and could have led to data memorization. The ground truths were censored at 75 from below and at 300 from above. The ground truths were also naturally truncated from below at zero, since negative PM2.5 values do not exist.

The second real dataset was the Seoul bike sharing demand dataset [65, 66], which is available from the UCI Machine Learning Repository [63] and Mendeley Data [67]. This dataset is also multivariate, having as input the hour, the season, if it was a holiday, if it was a working day, and weather data. The predicted variable is the count of rented bikes, and we censored it from below at 200 and from above at 2000. The amount of rented bikes is naturally truncated from below by zero. The training, validation, and testing sets are picked randomly and have ratios of 3/5, 1/5, and 1/5.

One hot encoding was applied to all categorical features. The continuous input features and ground truths were standardized to give zero mean and unitary standard deviation across both synthetic and real datasets. Figures 1 and 2 show the training and testing sets for the synthetic and real datasets used throughout the experiments.

4.2 Network model

The networks used for censored regression were similar for all four datasets and consisted of three fully connected layers with batch normalization in between. The input layers had a size of one for the univariate synthetic generated datasets and 46 and 40 for the real multivariate datasets. For all datasets, the hidden layers had a size of ten. In addition, for the real ones, the input and hidden layers were followed by dropout with a 20% probability. To estimate the standard deviation of the error in the heteroscedastic losses, the networks had two layers of similar sizes and did not apply dropout. Table 1 summarizes the network structure used for regression, while Table 2 summarizes the network structure used for standard deviation estimation, trained with the heteroscedastic losses. We found through unautomated experimentation that the structure proposed performs well for most of the datasets used in this article.

Table 1 Network structure—for regression
Table 2 Network structure—for standard deviation estimation

4.3 Model optimization and hyperparameters

Network optimization was performed in PyTorch [68] with super-convergence [69]. The hyperparameters were chosen using grid search, and the values for each experiment are listed in Appendix C.

4.4 Model evaluation

Three datasets were used for training, validation, and testing. Here, validation refers to evaluating the model’s performance during training for early stopping. The training and validation phases were repeated multiple times in a grid search across multiple hyperparameters. The best model was selected and evaluated on the testing dataset; we refer to this final phase as testing, and it should not be confused with validation.

The missing ground truths in the censored domain posed challenges in determining the exact performance of the model. As a solution, we used uncensored datasets and censored them, which offered the advantage of knowing the ground truths in both the censored and uncensored domains. This enabled us to provide rigorous measurements across the censored samples, which would not have been possible otherwise. Note that this was done only for experimental purposes; in practice, the ground truths for the censored domain are never known. To ensure a realistic and conservative approach, we censored the training and validation datasets in our experiments and left only the testing dataset uncensored. In other words, the ground truths for the censored domain were unknown for the training and validation sets in order to make the experiments realistic and were known only for the test set to enable precise measurements. The use of censored datasets for training and validation is essential to obtain unbiased results. When performing measurements in a real-world scenario where the censored ground truths are unknown, two common-sense ideas can be followed: the evaluation should be precise for the uncensored data, and the model should not output values in the uncensored domain for censored data. Both of these conditions are satisfied if we truncate the model output to lie in the uncensored domain and then use the truncated output to calculate the evaluation metrics, as shown in Eqs. (32), (33), and (34). We followed this approach in the current work for training early stopping and model selection (in the training and validation phases). We did not use output truncation in testing because we had access to the censored ground truths at that stage. Although we could have opted to truncate the model output only for the censored samples rather than for all samples, this would have introduced a bias, as explained toward the end of Sect. 3.1.

$$\begin{aligned} y^*& = f(x) \end{aligned}$$
(32)
$$\begin{aligned} y\prime& = {\left\{ \begin{array}{ll} c_l, & \quad \text {if}\ y^*\le c_l\\ y^*, & \quad \text {if}\ c_l<y^*<c_u\\ c_u, & \quad \text {if}\ y^*\ge c_{u} \end{array}\right. } \end{aligned}$$
(33)
$$\begin{aligned} m& = metric(y\prime ,y);\ c_l<y<c_u \end{aligned}$$
(34)

where

  • \(y^*\) is the estimated true value; it is allowed to have values also in the censoring interval \(({-\infty ,\ c}_l] \cup [c_u,\ \infty )\);

  • \(y'\) is the censored value of the estimated \(y^*\); it can only contain values inside the uncensored domain \([c_l,c_u]\);

  • y is the ground truth, which is also censored and lies in the interval \([c_l,c_u]\);

  • m is the calculated metric; in our experiments, we used the mean absolute error and the adjusted \(R^2\) score.

Neural networks are known to be susceptible to overfitting and may output extreme values in the censored domain; therefore, the practitioners should perform some checks. This can be achieved by sorting the output and inspecting the most significant n values. Another option is visualization: a plot can be created in which the vertical axis shows the model output and the horizontal axis the model input. The censoring thresholds should be marked as horizontal lines on the plot. If truncation is present, it can be displayed in a similar way. If the input is multidimensional, as is usually the case, principal component analysis (PCA) can be used to transform the data to a single dimension. We are not concerned with the limitations of PCA, since it is used only for visualization. Figure 3 shows an example, in which the upper two red lines are the censoring thresholds, and the lower red line is the lower truncation limit. Whether a value is considered extreme needs to be assessed by applying domain-specific knowledge. If the model outputs extreme values in the censored domain, it might be a sign of overfitting and can be corrected through regularization techniques. Another option to offset extreme values is to set truncation limits as described in Sects. 3.1 and 3.2. We used two metrics: the mean absolute error (MAE), shown in Eq. (35), and the adjusted coefficient of determination \(R_{adj}^2\) shown in (37).

$$\begin{aligned} MAE& = \sum _{i}{\vert y_i-{{\hat{y}}}_i \vert } \end{aligned}$$
(35)
$$\begin{aligned} R^2& = 1-\frac{\sum _{i}{\left( y_i-{{\hat{y}}}_i\right) }^2}{\sum _{i}{\left( y_i-{\bar{y}}\right) }^2} \end{aligned}$$
(36)
$$\begin{aligned} R_{adj}^2& = 1-\frac{\left( 1-R^2\right) (n-1)}{n-k-1} \end{aligned}$$
(37)

where

  • y is the ground truth;

  • \({\hat{y}}\) is the model prediction;

  • \({\bar{y}}\) is the ground truth mean;

  • n is the number of samples in the dataset;

  • k is the number of regressors (the number of dimensions of the input).

Fig. 3
figure 3

Example of plot used to inspect if the model outputs extreme values. The blue dots represent the model predictions, the upper two red lines represent the censoring thresholds, and the lowest red line represents the lower truncation

4.5 Ablation experiments

We perform ablation experiments on the losses proposed in Sect. 3. The full version of each loss has three parts: one that deals with uncensored data, one that deals with censoring, and one that deals with truncation. In each ablation experiment, we test three cases: the complete loss, the loss without truncation, and the loss without both truncation and censoring. One limitation of our experiments is that we do not use upper truncation, because our datasets have only lower truncation. In practice, it is hard to find datasets with truncation from both below and above at the same time, and truncation from below by zero is probably the most common. We also use the linear tobit model, which removes the contribution from the deep network. For clarity, we rewrite the losses described in Sect. 3, display them alongside the ablation versions, and drop the logic related to upper truncation to reflect the actual experimental conditions. These equations are found in Appendix A.

4.6 Visualizations

In addition to numerical metrics, visualization is an expressive way of investigating model outcomes. Although we visualized the predictions of all loss versions used in our experiments on both synthetic and real datasets, we present only images for the tobit-based loss with a fixed standard deviation due to space limitations. The rest of the images are included in Appendix B. We created two types of visualization, for the univariate synthetic, and multivariate real data. The univariate constant noise synthetic dataset, shown in Fig. 1, was created by adding noise to a nonlinear function, as explained in Sect. 4.1. The ground truths were censored for the training and validation sets but not for the testing set. A performant model should output values close to the actual nonlinear function in both the uncensored and censored domains while ignoring the noise. To visualize this effect, we plotted the true uncensored nonlinear function without the added noise versus the model predictions. We also plotted the actual validation samples to highlight the contrast between the censored noisy data and the true function. Figure 4 shows the five ablations for the tobit log-likelihood loss with fixed standard deviation on the constant noise synthetic dataset. As can be observed, the contribution of the censoring component is essential for regression in the censored domain. The truncation also had a positive impact by preventing values that were too low, toward the right tail of the distribution.

The visualizations for the multivariate Beijing PM2.5 dataset [61, 62] are shown in Fig. 5 and are similar to Fig. 3, which was used to assess whether the model output extreme values; however, the ground truths are also plotted along with the model predictions. The ground truths correspond to the uncensored testing set, as described in Sect. 4.4. The horizontal red lines are used to mark the censoring and truncation thresholds. The tobit-based loss with truncation was the best model. Removing the truncation component from the loss resulted in invalid predictions that were much lower than the truncation limit. If the censoring component was removed, the model was unable to regress values in the censored domain. The last two ablations replaced the deep network with a linear model, which performed less well. Truncation also proved to be useful in the linear model. Similar patterns were seen for the other losses. The associated visualizations are given in Appendix B.

Fig. 4
figure 4

Visualizations of ablation experiments for the tobit log-likelihood loss for the constant noise synthetic dataset. The error standard deviation of the tobit log-likelihood is a fixed scalar learned during training via gradient descent. The visualizations for the other losses are presented in Appendix B. The image in a shows the full version of the loss, and it can be seen that the model prediction, shown in green, closely approximates the actual generating function, shown in blue. The orange dots represent the validation set, which was censored in the same way as the training set. The image in b shows the loss without truncation, which makes the model predict values that are too low toward the right tail of the function. The image in c shows the case when both truncation and censoring are removed; as expected, in this case, the model is unable to predict values outside the uncensored domain. In the last images in d and e, the deep network is replaced with a linear model that is not suitable for this nonlinear problem

Fig. 5
figure 5

Visualizations of ablation experiments for the tobit log-likelihood loss for the multivariate Beijing PM2.5 dataset [61, 62]. The error standard deviation of the tobit log-likelihood is a fixed scalar learned during training through gradient descent. The ground truths are shown as blue dots, and the model predictions are in orange. The uppermost red horizontal lines show the higher and lower censoring bounds, while the lowermost red line is the lower truncation limit. The best model is shown in (a), where the complete loss (including truncation) is used; removing the truncation component resulted in the worst model, as shown in (b). Truncation is very important for the PM2.5 dataset, as many samples are between the lower censoring bound and the lower truncation limit. The lower truncation limit, close to − 2, is the standardized zero value. The image in c shows the case when both truncation and censoring are removed; as expected, in this case, the model was unable to predict values outside the uncensored domain. The deep network was replaced with a linear model in the images (d) and (e), and a degradation in performance is seen. Truncation also helped in the case of the linear model

5 Results

In this section, we validate the losses developed in Sects. 3.1 and 3.2 through ablation experiments and perform a comparison.

5.1 Ablation experiments results

The numerical results of the ablation experiments are presented in Table 3. In table, AE stands for absolute error and \(R_{adj}^2\) for the adjusted coefficient of determination.

Each loss has three components: one that deals with the exact values, one with censoring, and one with truncation. In the ablation experiments, the components referring to censoring and truncation are disabled to verify their contribution to the final results. Therefore, the proposed losses (marked in Table 3 as “Complete”) are valid if they provide better results compared to their ablated versions.

In Table 3, the complete censored mean absolute error corresponds to the loss proposed in our paper in Eq. 9, which handles both censoring and truncation. If the truncation component is removed, the loss is similar to the ones used in [4, 8, 21] and presented in Eq. 8. If both truncation and censoring are eliminated, the loss corresponds to the mean absolute error (MAE).

Similarly, the complete version of the censored mean squared error corresponds to the loss proposed in our work in Eq. 10, which handles both censoring and truncation. If the truncation is removed, the loss corresponds to the one from [7, 15], which is described in section 2.2. If both truncation and censoring are removed, the loss will be the mean squared error (MSE).

The last two losses from Table 3 are based on the tobit likelihood. Their complete version corresponds to the losses we proposed in Eqs. 23 and 27. The ablation experiment where both censoring and truncation are removed corresponds to the log-likelihood loss. The losses labeled as “Complete Linear” and “No Trunc. Linear” refer to the classical tobit model with and without truncation. Those are called linear because the classical tobit model relies on linear regression.

The loss formulas used in the ablation experiments are summarized in Appendix A.

The full versions of the losses containing both censoring and truncation outperformed the ablations in almost all cases. The results also show that truncation is essential, and sometimes, using losses with censoring but without truncation results in invalid models that output extreme values. Sometimes those models are worse than those trained classically, without any censoring handling. This phenomenon can be better observed in the images presented in Appendix B.

Table 3 Ablation experiments

5.2 Loss comparison

Table 4 compares the best losses, which had both censoring and truncation. On the constant noise synthetic and PM2.5 datasets, the best losses were the tobit log-likelihood with a fixed error standard deviation and the reparametrized version with a fixed error standard deviation. These two versions also gave some of the best results on the Seoul bike sharing dataset and performed decently on the synthetic heteroscedastic set. Therefore, the tobit log-likelihood with a fixed error standard deviation and the reparametrized version are good starting choices. Losses trained with heteroscedastic errors did not show better outcomes on the PM2.5 dataset. The reparametrization version with dynamic deviation showed promise, as it approached the other losses’ performance and surpassed the fixed deviation versions on the heteroscedastic synthetic dataset. The simple tobit log-likelihood with dynamic deviation was very numerically unstable and required many trials to find a working set of parameters. If a practitioner is interested in the dynamic deviation versions of the loss, we recommend the reparametrization, which is more numerically stable. Overall, the losses trained with a fixed \(\sigma\) performed better; the only case in which the dynamic \(\sigma\) showed an improvement was on the synthetic heteroscedastic dataset and in combination with the reparametrization. The reparametrization seemed to work better with the dynamic \(\sigma\) in most cases.

Table 4 Loss comparison

6 Discussion

The loss functions examined here can be split into two categories: those based on distances, such as the mean absolute error and mean squared error, and those based on likelihood, including all tobit variations. We showed how to handle truncation together with censored regression. Truncation can prevent extreme values. Hence, it can also serve as regularization. One question that may arise is which loss function is the best suited for censored regression. We believe each type of loss may be appropriate in a particular scenario. The mean absolute error and the mean squared error showed a slight decrease in performance compared to the tobit-based losses, but these have one great advantage in that they are very simple. Tobit-based losses complicate the data preprocessing stage because it is necessary to determine for each sample whether it is censored from above, below, or neither. This is not required for the mean absolute error or mean squared error. The standard deviation in the tobit models introduces additional complexity, as it is a separate parameter that should be optimized jointly with the network. In our experiments, we used the same learning rate to update the network weights and the deviation, but we could have chosen to use different learning rates. In the reparametrized versions, the output must be rescaled by dividing with \(\vert \gamma \vert\), the inverse of the standard deviation of the error. Although the standard deviation of the error can be informative, the mean absolute error and mean squared error are simpler, because they do not refer to it. Given the need for practicality and speed of development, the mean absolute error and mean squared error should be tried first, and only then should the tobit losses be applied.

Of the tobit-based losses, those with a fixed learnable standard deviation for the error seem to perform better and are likely more appropriate in most cases. Choosing to treat the deviation as a hyperparameter and finding it through experimentation might be an option; however, we did not explore this route. A comparison of the reparametrized versus simple tobit losses showed that these performed almost the same. The simple tobit loss is more straightforward to understand and does not need output rescaling, and might therefore be more practical, but reparametrization also has certain advantages; when used with a dynamic deviation, the reparametrization performed decently on the PM2.5 dataset and surpassed the fixed deviation version on the heteroscedastic synthetic dataset. In contrast, the plain tobit log-likelihood with dynamic deviation was very numerically unstable.

The losses proposed in this article must be coupled with the appropriate evaluation methods, as discussed in Sect. 4.4 and given in Eqs. (32), (33), and (34). We practically censored the network output when evaluating the metrics to avoid penalizing the values in the censored domain. This could be made more complex by including penalties for values that fall within an invalid domain, such as extreme results.

The ideas presented here open up several new avenues for research. In the current article, we looked at multi-layer, fully connected neural networks; however, we acknowledge that the models used here were small compared to other deep models. We did not investigate whether these approaches were suitable for convolutional and for recursive networks. We also looked at only the simplest type of tobit model, the type one model [2]. Furthermore, the tobit model relies on the Gaussian distribution, although generalizations to other distributions might be possible. For example, in survival analysis, several works have used Weibull-based losses, as described in Sect. 2.6. Another approach that was popular among the works discussed in Sect. 2.6 and that was not explored in this paper was multi-task learning, in which cross-entropy is used to predict whether a value is above or below a given threshold and a separate loss is applied to handle uncensored values. Cross-entropy could also be combined with one of the losses described in this article for multi-task learning.

7 Conclusions

Censored data are essential to many real-world applications. At the same time, deep learning has achieved outstanding results in most areas of machine learning; by combining these two approaches, new possibilities can be uncovered. Theoretically, most applications with censored data that have already leveraged the tobit model could benefit from the power of deep learning without sacrificing the advantages offered by the tobit. Moreover, as noted in this article and in the other works reviewed in Sect. 2, other classic deep learning losses, such as the mean absolute error and mean squared error, can be adapted to censored data with minimal effort.