1 Introduction

Mixture models are a common choice to model the joint distribution of several subpopulations or subclasses on the basis of data from the pooled population where the subclass memberships are not observed (for an introduction see McLachlan and Peel 2004). Each subclass is assumed to follow a parametric probability distribution such that the pooled observations are from a mixture of these distributions. Many applications of mixture models aim at estimating the distributions of the latent subclasses and identifying subclass memberships of the observations (McLachlan et al. 2019). Mixture models have also been used in machine learning, e.g., for clustering (Viroli and McLachlan 2019), to build generative models for images (Van den Oord and Schrauwen 2014) or as a hybrid approach for unsupervised outlier detection (Zong et al. 2018).

Different kinds of mixture models are used depending on the available data structure and the parametric model which is assumed for each subpopulation or subclass. In the following, we assume a supervised learning task and that regression models are used to model the subpopulations. This leads to mixture regression models which define a mixture of (conditional) models for the outcome of interest, where the mixture components are still unknown and thus considered latent variables. This model class is also referred to as mixtures of experts (Gormley and Frühwirth-Schnatter 2019). Mixtures of experts allow for the inclusion of covariates when modeling the mixture components as well as for the mixture weights. In the following, we consider a mixture of experts model where the regression models do not only include the mean parameter but also other distributional parameters that may depend on the covariates or features. In addition, we assume that the subpopulation sizes vary with covariates (or features). This leads to the class of mixture of experts distributional regression models.

1.1 Mixture models and their estimation

As for classical statistical regression models, the goal of mixtures of regressions or mixture regression models is to describe the conditional distribution of a response (or outcome), conditional on a set of covariates (or features). Mixtures of regressions have been first introduced by Quandt (1958) under the term switching regimes where only two-component mixtures of linear regression models were considered, i.e., the number of subclasses was fixed to two. This was extended to general mixtures of linear regression models by DeSarbo and Cron (1988) who referred to this approach as a model-based version of clusterwise regression (Späth 1979). The extension to mixtures of generalized linear regression models was proposed in Wedel and DeSarbo (1995). Aiming for a setting beyond mean regression (Kneib 2013), a distributional regression setting can be used such as generalized additive models for location, scale and shape (GAMLSS; Rigby and Stasinopoulos 2005) which also allow for nonlinear smooth relationships between covariates and the distributional parameter of interest (Stasinopoulos et al. 2018).

Mixture models can be estimated using various techniques, with the expectation–maximization (EM) algorithm based on the maximum-likelihood principle being the most prominent one. Other approaches include Bayesian methods such as MCMC algorithms with data augmentation (Diebolt and Robert 1994). An alternative way of model specification and estimation was proposed by Bishop (1994) who introduced mixture density networks (MDNs). MDNs use a mixture of experts distributional regression model specification. But the relationship between the covariates (or features) and the mixture weights or the distributional regression models is learned using a neural network. The training of MDNs is done using highly optimized stochastic gradient descent (SGD) routines with adaptive learning rates and momentum. These procedures have proven to be very effective in the optimization of large neural networks with millions of parameters and complex model structures. While often MDNs are advantageous in terms of prediction performance, they lack interpretability as the relationship between inputs (features) and distribution parameters is modeled by a deep neural network.

1.2 Our contribution

1.2.1 Novel modeling approach

In this work, we combine the ideas of interpretable mixtures of regression models and MDNs to allow for a mixture of experts distributional regression models in a very general setup. In particular, this approach enables modeling mixtures of subpopulations where the distribution of every subpopulation is modeled using a distributional regression. Predictors of every distribution parameter in every subpopulation can be defined by linear effects or (tensor product) splines and thereby allow not only for complex relationships between the features and the distributions’ mean but also for other distribution characteristics such as scale or skewness. Furthermore, the subpopulation sizes may vary with features in a flexible way using again a combination of linear effects as well as (tensor product) splines.

1.2.2 Robust estimation

Estimating mixture regression models within a maximum-likelihood framework on the basis of the EM algorithm works well for smaller problems. However, higher-dimensional settings where the number of parameters is similar to or exceeds the number of observations are often infeasible to estimate. In fact, the convergence and stability of classical algorithms are more and more adversely affected by an increase in model complexity. Maximum-likelihood estimation of distributional regression models itself induces a non-convex optimization problem in all but special cases. Thus, extending mixtures of mean regression models to mixtures of distributional regression models further complicates the optimization.

In this work, inspired by MDNs, we suggest and analyze the usage of stochastic first-order optimization techniques using adaptive learning rate schedulers for (regularized) maximum-likelihood estimation. Our results show that this approach is as reliable as estimation via classical approaches in many different settings and even provides estimation results in complex cases where classical approaches consistently fail.

1.2.3 Flexible and scalable implementation

Common implementations of EM optimization routines are limited in their flexibility to specify a mixture of (many) potentially different distributions but in general focus on the case where all components have a parametric distribution from the same distributional family. Another (computational) limitation of existing approaches is faced for large amounts of data (observations) as methods usually scale at least quadratic in the number of observations. On the other hand, SGD optimizers from the field of deep learning are trained on mini-batches of data allowing large dataset applications and the use in a generic fashion for all model classes. We take advantage of this flexibility and scalability and implement the fitting of the proposed model class using mixdistreg, an R (R Core Team 2022) software package based on the R package deepregression (Rügamer et al. 2023) that allows for the definition of mixtures with components from many different distributional families, estimation in high-dimensional and large sample size settings and robust optimization based on the deep learning platform TensorFlow (Abadi et al. 2015). In order to use TensorFlow within R, the package reticulate (Ushey et al. 2022) is used to connect R to Python (Van Rossum and Drake Jr 1995).

1.2.4 Summary of our approach and overview on the paper structure

Our framework unites neural density networks (Magdon-Ismail and Atiya 1998) with (distributional) regression approaches, extends MDNs by incorporating penalized smooth effects and comprises various frameworks proposed in the statistical community such as Leisch (2004); Grün and Leisch (2007); Stasinopoulos and Rigby (2007) to estimate mixtures of linear, generalized linear, generalized additive or distributional regression models. In contrast to many existing approaches, our framework further allows to estimate the mixture weights themselves on the basis of an additive structured predictor. We refer to this combination of the model class of mixtures of experts distributional regression with structured additive predictors and the estimation using neural network software as the neural mixture of experts distributional regression (NMDR) approach.

The remainder of this paper is structured as follows. In Sect. 2, we present our model definition. In Sect. 3, we introduce the architecture for SGD-based optimization in neural networks and discuss penalized estimation approaches including mixture sparsification. We then demonstrate the framework’s properties using extensive numerical experiments in Sect. 4 and its application to real-world data in Sect. 5. We conclude with a discussion in Sect. 6.

2 Methodology

Our goal is to model the conditional distribution of \({Y} \mid {\varvec{x}}\) where Y is the univariate outcome (or response) of interest. We assume that \(Y \mid {\varvec{x}} \sim \mathcal {F}\), where \(\mathcal {F}\) is a mixture of parametric distributions \(\mathcal {F}_m, m \in \{ 1,\ldots ,M \} =: \mathcal {M}\). These distributions, in turn, depend on unknown parameters \({\varvec{\theta}}_m({\varvec{x}}) \in \mathbb {R}^{k_m}\) that are influenced by a set of features (or covariates) \({\varvec{x}} \in \mathbb {R}^p\). Furthermore, the nonnegative mixture weights are also allowed to depend on features (or covariates) \({\varvec{x}} \in \mathbb {R}^p\) such that \(\pi _m({\varvec{x}})\) for all \(m=1,\ldots ,M\) with \(\sum _{m=1}^M \pi _m({\varvec{x}}) = 1\).

2.1 Model definition

For an observation y in (a suitable subset of) \(\mathbb {R}\) constituting the support from the conditional distribution of \(Y\mid {\varvec{x}}\), we define the conditional density by

$$\begin{aligned} f_{Y\mid {\varvec{x}}}(y \mid {\varvec{\vartheta}}({\varvec{x}})) = \sum _{m=1}^M \pi _m ({\varvec{x}}) f_m(y\mid {\varvec{\theta}}_m({\varvec{x}})), \end{aligned}$$
(1)

i.e., by a mixture of density functions \(f_m\) of the distributions \(\mathcal {F}_m\). Each component density \(f_m\) has its own \(k_m\) distribution parameters \({\varvec{\theta}}_m = (\theta _{m,1}, \ldots , \theta _{m,k_m})^\top\). \(\pi _m \in [0,1]\) are the mixture weights with \(\sum _{m=1}^M \pi _m = 1\). The vector \({\varvec{\vartheta}}= ({\varvec{\theta}}^\top , {\varvec{\pi}}^\top )^\top \in \mathbb {R}^{K+M}\) with \(K=\sum _{m=1}^M k_m\) comprises all distribution parameters \({\varvec{\theta}} = ({\varvec{\theta}}_1, \ldots , {\varvec{\theta}}_M)^\top\) and all mixture weights \({\varvec{\pi}} = (\pi _1,\ldots ,\pi _M)^\top\).

Each of the parameters \(\vartheta _j\) in \({\varvec{\vartheta}}\) is assumed to depend on the features \({\varvec{x}}\) through an additive predictor \(\eta _j, j=1,\ldots ,K+M\). For the distribution parameters \({\varvec{\theta}},\) a monotonic and differentiable function \({h_j, j=1,\ldots ,K,}\) is assumed to provide a map between the additive structured predictor and the distribution parameter, i.e., \(\vartheta _j = h_j(\eta _j({\varvec{x}}))\). The parameter-free transformation function \(h_j\) ensures the correct domain of each \(\vartheta _j\). For example, if \(\vartheta _j\) represents a scale parameter the function, \(h_j\) ensures that \(\vartheta _j\) is positive while the additive predictor \(\eta _j\) may take arbitrary values in \(\mathbb {R}\).

For the mixture weights, \({\varvec{\pi}}\) a single monotonic and differentiable function \(h_{K+1}\) is assumed to map the M additive predictors to the \((M-1)\)-dimensional simplex, i.e., \(h_{K+1}\) maps \(\mathbb {R}^M \rightarrow [0,1]^M\) under the condition that the sum of the weights is 1. This links the last M additive predictors \({\varvec{\eta}}_{\pi }:= (\eta _{K+1},\ldots ,\eta _{K+M})^{\top }\) to the set of mixture weights \({\varvec{\pi}}\). The most common choice in this respect for \(h_{K+1}\) is the softmax function

$$\begin{aligned} h_{K+1}({\varvec{\eta}}_{\pi }) = (\text {softmax}_1({\varvec{\eta}}_{\pi }), \ldots , \text {softmax}_M({\varvec{\eta}}_{\pi })) \end{aligned}$$

with

$$\begin{aligned} \text {softmax}_j({\varvec{\eta}}) = \frac{\exp (\eta _j)}{\sum _{l=1}^M \exp (\eta _l)}. \end{aligned}$$

This implies

$$\begin{aligned} \pi _m({\varvec{x}}_i) = \frac{\exp (\eta _{K+m}({\varvec{x}}_i))}{\sum _{l=1}^M \exp (\eta _{K+l}({\varvec{x}}_i))}, \quad \text {for}\quad m \in \mathcal {M}. \end{aligned}$$
(2)

Effects in the additive predictors \(\eta _{K+l}, l=1,\ldots ,M\), in (2) are not identifiable without further constraints. We do not enforce any constraints during model training. Model interpretation is still possible in relative terms (e.g., using a log-odds interpretation). However, some constraints would need to be imposed if identifiable regression coefficients for the additive predictors are to be obtained.

For all parameters in \({\varvec{\vartheta}}\), the additive structured predictors \(\eta _j({\varvec{x}})\) ensure interpretability of the relationship between the parameter \(\vartheta _j\) and the covariates. For example, if \(\eta _{K+M}\) is a linear model, i.e., \(\eta _{K+M}({\varvec{x}}) = {\varvec{x}}^\top {\varvec{\beta}}\), the regression coefficients \({\varvec{\beta}}\) can be interpreted as linear contributions of each of the features to the logits of the mixture weight for the last mixture component M.

2.2 Additive predictor structure

The model (1) relates all model parameters \({\varvec{\vartheta}}\) to features \({\varvec{x}}\) through additive predictors \(\eta _j, j=1,\ldots ,K+M\). As different densities \(f_m\) and also different parameters \({\varvec{\theta}}_m\) have potentially different influences on the conditional distribution of \(Y \mid {\varvec{x}}\), every parameter in \({\varvec{\theta}}\) is defined by its own additive structured predictor \(\eta _j\). Here we assume the additive predictors to have the following structure:

$$\begin{aligned} \eta _j = \beta _{0,j} + {\varvec{x}}_{\mathcal {L}(j)}^\top {\varvec{\beta}}_j + \sum _{l\in \mathcal {S}(j)} \phi _{l,j}({\varvec{x}}), \end{aligned}$$
(3)

where \(\beta _{0,j}\) corresponds to the model intercept, \({\varvec{\beta}}_j\) are the linear effects for pre-defined covariates \({\varvec{x}}_{\mathcal {L}(j)}\) with \(\mathcal {L}(j) \subseteq \{1,\ldots ,p\} \cup \emptyset\) being a subset of all possible predictors and \(\phi _{l,j}\) are nonlinear smooth functions for one or more covariates in \({\varvec{x}}\) with \(\mathcal {S}(j)\) being a (potentially empty) set of indices indicating the covariates with nonlinear predictor effects. We assume that every \(\phi _{l,j}({\varvec{x}})\) can be represented by (tensor product) basis functions \(B_{l,j,o},o=1,\ldots ,O\), taking one or several columns of \({\varvec{x}}\) as input and mapping these onto the space spanned by the basis functions. Denoting \({\varvec{z}}_{l,j}:= \varvec{B}_{l,j}({\varvec{x}}) = (B_{l,j,1}({\varvec{x}}),\ldots ,B_{l,j,O}({\varvec{x}}))^\top \in \mathbb {R}^O\), the nonlinear smooth terms can be written as \(\phi _{l,j}({\varvec{x}}) = {\varvec{z}}_{l,j}^\top {\varvec{\gamma}}_{l,j}\) where \({\varvec{\gamma}}_{l,j} \in \mathbb {R}^O\) are the corresponding basis coefficients for \({\varvec{z}}_{l,j}\).

In principle, such a flexible specification may also be used for \(\eta _{K+1}, \ldots , \eta _{K+M}\), i.e., for the additive structured predictors related to \({\varvec{\pi}}\). While this is technically possible (using a different subnetwork for every \(\pi _m\)), these predictors are usually assumed to share one and the same additive structure, i.e., \({\varvec{x}}_{\mathcal {L}(j)}\) and \({\varvec{z}}_{l,j}\) with \(l \in \mathcal {S}(j)\) are the same for all j related to the M mixture weights. This further enables a straightforward interpretation of the predictor–mixture weight relationship, as sharing one additive predictor across all \(\pi _m\)s resembles a multinomial logistic regression model. We will thus assume the same specification for all these predictors. We summarize all model coefficients of the linear terms by \({\varvec{\beta}} = (\beta _{0,1}, {\varvec{\beta}}_1^\top , \beta _{0,2}, {\varvec{\beta}}_2^\top , \ldots , \beta _{0,K+M}, {\varvec{\beta}}_{K+M}^\top )^\top\) and all model coefficients for representing the nonlinear smooth functions by \({\varvec{\gamma}} = (\gamma _{0,1}, {\varvec{\gamma}}_1^\top , \gamma _{0,2}, {\varvec{\gamma}}_2^\top , \ldots , \gamma _{0,K+M}, {\varvec{\gamma}}_{K+M}^\top )^\top\). In the following, we summarize these two parameter vectors as \({\varvec{\psi}} = ({\varvec{\beta}}^\top , {\varvec{\gamma}}^\top )^\top\).

2.3 Model log likelihood

Based on model (1) and the structures imposed on the predictors in (3), the negative log likelihood of the model parameters \({\varvec{\psi}}\) for n independent observations \({\varvec{y}} = (y_1, \ldots , y_n)^\top \in \mathbb {R}^n\) and their corresponding n observed feature vectors \({\varvec{x}}:= ({\varvec{x}}^\top _1, \ldots , {\varvec{x}}^\top _n)^\top \in \mathbb {R}^{n \times p}\) is given by the sum of the negative log-likelihood contributions \(\ell _i({\varvec{\psi}})\) of each observation \(i=1,\ldots ,n\):

$$\begin{aligned} \ell ({\varvec{\psi}}) := \sum _{i=1}^n \ell _i({\varvec{\psi}}) = -\sum _{i=1}^n \log \left\{ \sum _{m=1}^M \pi _m({\varvec{x}}_i) f_m(y_i\mid {\varvec{\theta}}_m({\varvec{x}}_i)) \right\} . \end{aligned}$$
(4)

The negative log likelihood can be rewritten as the negative sum of the exponentiated log likelihoods of both the mixture weights \(\pi _m\) and the component densities \(f_m\) using the log-sum-exp (LSE) function:

$$\begin{aligned} -\sum _{i=1}^n \log \left\{ \sum _{m=1}^M \exp \left[ \log \pi _m({\varvec{x}}_i) + \log f_m(y_i\mid {\varvec{\theta}}_m({\varvec{x}}_i)) \right] \right\} . \end{aligned}$$
(5)

In practice, formulation (5) is often preferred over (4) as it is less affected by under- or overflow problems.

2.4 Identifiability

Identifiability is of concern for the mixture of experts distributional regression model because of the identifiability problems which could occur due to the mixture specification, due to the additive structured predictors and due to the softmax mapping to the mixture weights.

2.4.1 Mixture models

For any mixture model, trivial identifiability problems may arise due to label switching and overfitting with either empty or duplicated components. In addition, also generic identifiability problems may occur for mixtures of distributions but also for mixtures of regressions. A detailed overview of identifiability problems in the finite mixture case is given in Frühwirth-Schnatter (2006).

2.4.2 Additive structured predictors

The components of the additive model with predictor structures of the form (3) are in general only identifiable up to a constant and also require further restrictions if both linear and nonlinear smooth effects are defined for one and the same covariate. For example, \(\eta _j = \beta _{0,j} + \phi _{1,j}(x_1)\) can be equally represented by \({\tilde{\beta }}_{0,j} + {\tilde{\phi }}_{1,j}(x_1)\) by defining \({\tilde{\beta }}_{0,j} = \beta _{0,j} + c\) and \(\tilde{\phi }_{1,j}(x_1) = {\phi }_{1,j}(x_1) - c\) with \(c\in \mathbb {R}\), i.e., by adding and subtracting a constant c in both terms. We ensure the identifiability of nonlinear smooth terms \(\phi _{l,j}\) by using sum-to-zero constraints for all nonlinear additive functions such as splines or tensor product splines. The identifiability of linear effects \(x\beta\) in the presence of a univariate nonlinear effect \(\phi (x)\) must also be ensured. In this case, several different options exist (see, e.g., Rügamer et al. 2023b). The most straightforward way is to use a nonlinear effect \(\phi (x)\) with a basis representation that includes the linear effect as null space (and hence, for enough penalization as discussed in Sect. 3.3, results in a linear effect). In this case, \(\mathcal {L}(j)\) only consists of variables that are only modeled using a linear component (e.g., categorical effects) and \(\mathcal {L}(j) \cap \mathcal {S}(j) = \emptyset\).

3 Model representation and robust estimation

The observed data log likelihood is, in general, not concave and thus difficult to optimize. We suggest to use optimizers from the field of deep learning by framing our model as a neural network. This enables the fitting of the full model class of mixture of experts distributional regression models with additive structure predictors in a straightforward way. Numerical experiments confirm that this choice—when properly trained—is not only more flexible and robust than EM-based optimization but also makes large dataset applications feasible due to mini-batch training.

3.1 Neural network representation

Models described in (1) and (3) can be represented as neural networks in the following way. An exemplary architecture is depicted in Fig. 1. The network architecture implementing model (1) defines at most K subnetworks, where each subnetwork models one or more additive structured predictors \(\eta _{j}\) of a distribution parameter \(\theta _{j}\). Using an appropriate parameter-free transformation function, these additive structured predictors are mapped to the distribution parameters and passed to a distribution layer (Dillon et al. 2017). Each distribution layer corresponds to a mixture component \(f_m\) that is further passed to a multinomial or categorical distribution layer, modeling the mixture of all defined distributions. The mixture weights can either be directly estimated or also learned on the basis of input features using additive structured predictors in another subnetwork. A classical linear mixture regression combining M linear regressions would, e.g., be given by M subnetworks, each learning the expectation of a normal distribution and a mixture subnetwork that only takes a constant input (a bias) and learns the M mixture weights. The individual additive predictors for each subnetwork are, in most cases, fully connected layers with only one neuron. Such a layer describes a weighted linear combination of inputs \({\varvec{x}}\) with weights (here the regression coefficients \({\varvec{\beta}}\)). The same representation can also be used for basis function evaluations \({\varvec{z}}\) for nonlinear functions \(\phi\). For more details, we refer to Rügamer et al. (2023). Due to the unifying network structure of our approach, the use of other layers or deep neural networks as subnetworks is also possible, but not further discussed in this article (see Rügamer et al. 2023b, for details).

Fig. 1
figure 1

Example of an architecture. Smaller subnetworks (subnet) learn one or more parameters of a distribution which is defined in the respective distribution layer. For the first distribution in this example, each distribution parameter in \({\varvec{\theta}}_1\) is learned through a separate network while the second distribution is learned by a network that outputs all parameters \({\varvec{\theta}}_2\) together (e.g., used when \({\varvec{\theta}}_2\) are constrained parameters such as probabilities). Each distribution layer thus corresponds to a distributional regression model. The mixture model is then defined by an additional subnetwork that learns the mixture weights \({\varvec{\pi}}\) as well as by the M learned distributions \(f_1, \ldots , f_K\)

3.2 Optimization routines

Representing the model (1) and (3) as a neural network makes a plethora of optimization routines from deep learning readily available for model estimation. Various first-order stochastic optimization routines exist for neural networks. Most of these optimizers work with mini-batches \(J_1,\ldots ,J_B \subset \{1,\ldots ,n\}\) of data and hence perform stochastic gradient descent (SGD). The update of parameters \({\varvec{\psi}}^{[t]}\) in iteration \(t\in \mathbb {N}\) and mini-batch \(J_b\) is

$$\begin{aligned} {\varvec{\psi}}^{[t]} = {\varvec{\psi}}^{[t-1]} - \nu ^{[t]} \frac{1}{\mid J_b\mid } \sum _{i\in J_b} \nabla _{{\varvec{\psi}}} \ell _i({\varvec{\psi}}^{[t-1]}), \end{aligned}$$
(6)

where \({\varvec{\psi}}^{[0]}\) is some starting value, \(\nabla _{{\varvec{\psi}}} \ell _i({\varvec{\psi}}^{[t-1]})\) are the individual gradients of the negative log-likelihood contributions of the observations in the batch evaluated at the parameters of the previous iteration and \(\nu ^{[t]}\) is a learning rate updated using a learning rate scheduler. We determine the starting value \({\varvec{\psi}}^{[0]}\) with the Xavier uniform initialization scheme (Glorot and Bengio 2010). For the update of \(\nu ^{[t]}\), we found the following learning rate schedulers useful for the optimization of mixture of experts distributional regression models: RMSprop (Hinton et al. 2012), Adadelta (Zeiler 2012), Adam (Kingma and Ba 2014) and Ranger (Wright 2019). These optimizers in turn often come with hyperparameters such as their initial learning rate which need to be adjusted. We will investigate their influence in our numerical experiments section.

3.3 Penalized estimation

Estimating the model class of mixtures of experts distributional regressions with additive structured predictors can benefit from a penalized log-likelihood specification. Such a penalization allows to control the smoothness or wiggliness of the nonlinear smooth functions in the additive structured predictors and to induce sparsity in the estimation of the additive structured predictors as well as the mixture weights.

3.3.1 Additive structured predictors

In order to estimate suitable nonlinear smooth additive structured predictors based on splines within the neural network, the respective coefficients in each layer may require regularization using appropriate penalties or penalty matrices. The smoothness or wiggliness of the nonlinear additive structured predictor can be calibrated by either selecting a suitable dimension of the basis representation or imposing a penalization on the respective coefficients in combination with a generous basis representation. Using the later approach, the penalized version of (4) is given by

$$\begin{aligned} \ell _{pen}({\varvec{\psi}}) = \ell ({\varvec{\psi}}) + \sum _{l \in \Lambda } \lambda _l {\varvec{\gamma}}_l^\top \varvec{P}_l {\varvec{\gamma}}_l, \end{aligned}$$
(7)

with sets of index sets \(\Lambda\) defining the weights that are penalized using a quadratic penalty with individual smoothing parameters \(\lambda _l\) and individual quadratic penalty matrices \(\varvec{P}_l\) (see, e.g., Wood 2017). While estimating the tuning parameters is possible by including a more elaborated optimization routine in the neural network such as the Fellner–Schall method (Wood and Fasiolo 2017), we use the approach suggested by Rügamer et al. (2023b) to tune the different smooth effects by relating \(\lambda _l\) to their respective degrees of freedom \(\text {df}_l\). This has the advantage of training the network in a simple backpropagation procedure with only little to no tuning, by, e.g., setting the \(\text {df}_l\) equal to a pre-specified value for all \(l \in \Lambda\). More specifically, while setting the same \(\lambda _l\) value for all smooth terms would result in (very) different penalization strengths for every term, choosing their \(\lambda _l\) value through one shared df value, i.e., \(\text {df}_l \equiv \text {df}\), yields equally penalized smooth terms for all \(l \in \Lambda\). Setting this shared df value to a moderate number, e.g., \(\text {df} = 9\), proved to be a well-working prior assumption for each smooth term and circumvents expensive tuning schemes by fixing all \(\lambda _l\) values a priori.

3.3.2 Entropy-based penalization of mixtures

In order to penalize an excessive amount of nonzero mixture weights, we further introduce an entropy-based penalty for the mixture weights that can be simply added to the objective function in (7):

$$\begin{aligned} \ell _{ent}({\varvec{\psi}}) = \ell _{pen}({\varvec{\psi}}) - \xi \sum _m \pi _m \log \pi _m. \end{aligned}$$
(8)

The second part of (8) is controlled by a tuning parameter \(\xi \in \mathbb {R}^+_0\) and corresponds to the entropy induced by the (estimated) marginal mixture weights \({\varvec{\pi}}\), i.e., in case the mixture weights depend on covariates \({\varvec{x}}\) the mixture weights obtained when averaging over the observed \({\varvec{x}}\). A large value of \(\xi\) enforces the weight distribution to be sparse in the amount of nonzero elements in \({\varvec{\pi}}\), while smaller values will result in an (almost) uniform distribution of \({\varvec{\pi}}\). As the entropy is permutation-invariant w.r.t. the ordering of the components in \({\varvec{\pi}}\), this penalty is particularly suitable for mixtures that are only identified up to a permutation of the component labels. We investigate the effects of this tuning parameter \(\xi\) in Sect. 4.5. While it is technically possible to allow feature dependency of the conditional mixture weights which depend on \({\varvec{x}}\) in (8), further research is required to investigate the effect of the entropy-based penalty in this case.

4 Numerical experiments

We now investigate our framework in terms of predictive and estimation performance. To this end, we first compare our neural mixture of experts distributional regression (NMDR) approach with EM-based optimization routines to demonstrate competitiveness with state-of-the-art procedures. For this comparison, the model class considered is restricted to only contain constant mixture weights and linear predictors for the distributional parameters. We then evaluate our approach considering a mixture of generalized additive regression models with additional noise variables to demonstrate the framework’s efficacy when including additive predictor structures. Finally, we investigate an overfitting mixture setting, in which we simulate situations where EM-based optimization fails and NMDR with entropy-based penalty leads to superior performance. In Appendix, we also perform an empirical investigation of different optimization routines in the neural network context to provide a practical guideline for users of our framework. Results show that adaptive methods are more robust compared to a vanilla stochastic gradient descent optimization. A second simulation study in Appendix further investigates the model performance of different approaches when using concomitant variables.

4.1 Evaluation metrics

We measure the estimation performance of both the regression coefficients and the mixture weights \({\varvec{\pi}}\) using the root mean squared error (RMSE), the prediction performance using the predictive log scores (PLS; Gelfand and Dey 1994) on an independent test dataset of the same size as the training dataset and the differences between the true and estimated class memberships using the adjusted Rand index (ARI; Hubert and Arabie 1985) as well as the accuracy (ACC) by deriving the estimated class memberships based on the maximum a posteriori probabilities for each of the mixture components. In order to deal with the problem of label switching for the non-label-invariant performance measures, we first determine the labeling which induces the accuracy-optimal assignment and then calculate these performance metrics using this labeling. The accuracy-optimal labeling is computed using the true class memberships (known in this case as data is simulated) and the estimated class memberships as induced by the mixture posterior probabilities estimated by each model.

4.2 Initialization and optimization

The weights in all our experiments are initialized using a Xavier uniform initialization (Glorot and Bengio 2010). If not specified otherwise, we use RMSprop as optimizer with a learning rate of 0.001, a maximum of 10,000 iterations, early stopping with a patience of 250 epochs, an additional reduction of the learning rate when reaching a plateau for 150 epochs, a train validation split of 0.1 and a batch size of 32.

4.3 Comparison with EM-based optimization

We first compare the NMDR framework to an EM-based algorithm implemented in the R package gamlss.mx (Stasinopoulos and Rigby 2016) allowing for mixtures of various distributional regressions. We use \(n \in \{300, 2500\}\) observations, \(M \in \{2,3,5,10\}\) identically distributed mixture components, either following a Gaussian, a Laplace or a logistic distribution, each defined by location and scale parameter, and mixture weights are randomly drawn such that the smallest weight is not less than 3%. We use \(p_m \in \{2, 10\}\) features for each distribution and distribution parameter in the mixture, and uniformly sampled regression coefficients from a \(\mathcal {U}(-2,2)\)-distribution. While we test gamlss.mx with a fixed budget of 20 restarts, we compare these results to NMDR using 1 and 3 random initializations to assess the effect of multiple restarts. Each experimental configuration is replicated 10 times. All fitted models in this simulation are correctly specified, i.e., they correspond to the data generating process.

Fig. 2
figure 2

Comparison of EM-based optimization (EM), NMDR with one (NMDR) or three (NMDR_3) optimization runs for different distributions (columns), measures (rows) and combinations of n and \(p = p_m\). Boxplots contain all 10 runs and the four different settings for M (i.e., in total 40 data points per boxplot). Missing boxplots for EM are due to missing solutions caused by missing values when comparing current results to a convergence threshold. RMSEs for coefficients \(> 3\) and PLS values \(< -50\) are omitted to improve readability

4.3.1 Results

Results are visualized in Fig. 2 and yield four important findings. First, the EM-based approach only provides results in case \(p_m = 2\). The EM-based approach is not able to converge to any meaningful solution in all settings with \(p_m = 10\), whereas NMDR’s performance is affected by the increased number of predictors, but still yields reasonable results for \(n=2500\), also without restarts. Second, in the case \(p_m = 2\), the EM-based approach provides in general a better classification performance than NMDR (as indicated by ACC Prob. and ARI Prob.). Third, the difference in estimation performance in case \(p_m = 2\) between the EM-based and the NMDR approach is often negligible in terms of the RMSE between the estimated parameters. Fourth, while the induced regularization of the SGD-based routine induces a bias in the estimation and hence typically larger estimation errors, the predictive performance of NMDR is always better compared to the EM-based approach. Note that the best model for NMDR with multiple restarts is chosen based on the in-sample log score which does not necessarily imply better out-of-sample performance compared to a single optimization run.

4.4 Mixture of additive regression models

Next, we investigate mixtures of mean regression models with nonlinear smooth effects in the additive predictors of the mean distribution parameter. We generate \(n=2500\) data points from \(M = 3\) mixtures of Poisson or normal distributions with the additive structured predictor of the means defined by \(\eta _1({\varvec{x}}) = \beta _0 + \phi _1(x_1) + \phi _2(x_2)\), \(\eta _2({\varvec{x}}) = \beta _0 + \phi _2(x_1) + x_2\) and \(\eta _3({\varvec{x}}) = \beta _0 + x_1 + \phi _3(x_2)\) with \(\beta _0 = 0.5\), \(\phi _1(x) = 2 \sin (3x)\), \(\phi _2(x) = \exp (2x)\) and \(\phi _3(x) = 0.2 x^{11} (10 (1 - x))^6 + 10 (10 x)^3 (1 - x)^{10})\). All covariates are independently drawn from a uniform distribution \(\mathcal {U}(0,1)\). We model all nonlinear effects using thin-plate regression splines from Wood (2017) for all methods. Note that for large data sets, other basis functions are preferred as the preprocessing step for setting up thin-plate regression spline bases scales quadratically with the number of observations (Wood 2003). Since the preprocessing is done prior to setting up the neural network, our proposed approach can be flexibly combined with any of the existing spline basis function approaches. For Poisson data, we use \(h(\cdot ) = \exp (\cdot )\) and the identity for the Gaussian case. We vary \({\varvec{\pi}}\) to be either (1/3, 1/3, 1/3) or (1/10, 3/10, 6/10) and add 3 or 10 noise variables that are also included in the model as nonlinear smooth predictors for the expectation parameter. We use two different scale values 2 or 4, which either define the Gaussian variance in each mixture component or a multiplicative offset effect in the Poisson case. Our method is compared with a state-of-the-art implementation of mixtures of additive models using the R package flexmix (Leisch 2004) and, as an oracle, a generalized additive model with varying coefficients for all smooth effects, where the class label (unknown to the other two approaches) is used as the varying parameter. These two methods determine the smoothing parameters via an outer optimization loop as implemented in mgcv (see Wood 2017, for details). For NMDR, the smoothing parameters are determined as described in Sect. 3.3 via the respective degrees of freedom which are set equally for all smooths to 10 for normal and 6 for Poisson distribution. This process results in “equally smooth” nonlinear effects a priori. The combination of gradient descent updates and early stopping introduces additional regularization. This process favors parameters that result in a low prediction error, akin to other out-of-sample or model selection information criteria commonly employed to determine smoothing parameters (Wood and Fasiolo 2017). The adaptive learning rates designated for individual parameters can further lead to variable penalization across different additive components.

Fig. 3
figure 3

Comparison of average PLS and LS as well as ACC and ARI of a state-of-the-art implementation (EM), an oracle varying coefficient model with known class memberships and our approach (NMDR) in different colors for the two distributions and two scales (x-axis). The boxplots contain the results for 10 replications over two settings for the mixture weights and two settings for the number of noise variables

4.4.1 Results

The comparison for all settings is depicted in Fig. 3 showing the average log score (LS; calculated on the training dataset using the estimated model parameters) and PLS, as well as the ARI and ACC. The boxplots summarize all 10 simulation replications for the two different mixture weights and the two number of noise variables settings. Results suggest that our approach leads to better predictions measured by the PLS, but is inferior in terms of LS. The smaller LS values are possibly due to fewer data points to fit the model because of the need for a validation set and due to the shrinkage induced by early stopping the procedure. The median performance of the clustering induced based on the estimated posterior probabilities is in general on par with the EM-based approach in terms of ARI and ACC with flexmix, while showing even slightly better performance in the Gaussian case.

4.5 Misspecified mixtures and sparsity

In this simulation, we use a normal mixture with \(p_m = 10\) fixed linear predictors for each distribution and distribution parameter (mean and variance), where all features are again drawn from a standard normal distribution and regression coefficients from a uniform \(\mathcal {U}(-2,2)\)-distribution. The data are then generated using \(M=2\) actual mixture components with \(\pi _1\) drawn (independently of features) from a uniform distribution on the interval (0.06, 0.094) and \(\pi _2 = 1 - \pi _1\) to ensure that the minimum value of both probabilities is at least 6%. We then evaluate the estimation of mixture probabilities by NMDR for \(n\in \{300, 2500\}\) when increasing the number of specified distributions \(M^\dagger \in \{3, 5, 10\}\). To allow for sparsity in \({\varvec{\pi}}\), we use the objective function \(\ell _{ent}\) introduced in Sect. 3.3. For each scenario, 10 replications are performed.

Fig. 4
figure 4

Model quality for misspecified models with specified mixtures \(M^\dagger \in \{3, 5, 10\}\) (columns, counting the additional components) instead of actual \(M=2\) mixtures and different goodness of fit measures (rows) for 10 replications (boxes). Colors correspond to different settings of \(\xi\) or represent estimation results of the correct model (black)

4.5.1 Results

Results for various settings of the entropy penalty parameter \(\xi\) are depicted in Fig. 4. While setting \(\xi\) to small values larger than zero can improve the predictive performance and even outperform the correctly specified model without additional distribution components, the bias induced by the penalty generally decreases the estimation performance. In practice, an appropriate amount of penalization can be found by running cross-validation along a grid of different \(\xi\) values as, e.g., done for the Lasso (Tibshirani 1996).

We additionally investigate the coefficient path obtained from varying values of the penalty parameter \(\xi\) for one simulated example. Results for the \(n=2500, M^\dagger = 5\) setting are depicted in Fig. 5. \(\xi\) is varied between 0 and 1 on a logarithmic scale. The true model has two nonzero probabilities 0.6077 and 0.3923 while the other 3 entries in \({\varvec{\pi}}\) are 0.

Fig. 5
figure 5

Coefficient path (estimated mixture probabilities) for different entropy penalties. A value of around 1e-02 yields the best trade-off between sparsity and estimation performance

5 Cell cycle-regulated genes of yeast

In order to demonstrate the flexibility of our approach, we investigate its application to the yeast cell cycle dataset from Spellman et al. (1998). In this study, genome-wide mRNA levels were measured for 6178 yeast open reading frames (ORFs) for 119 min at 7-min intervals. We here analyze the subset of data where all 18 time points for the alpha factor arrest are available. The resulting longitudinal dataset consists of 80,802 observations of the standardized expression levels. A subset of this dataset was also analyzed using mixture models in Grün et al. (2011).

5.1 Distributional mixture regression

As both the mean and standard deviation of the standardized expression levels of genes change over time, we apply a mixture of distributional regressions model where the mean \(\mu\) and the standard deviation \(\sigma\) of the normally distributed mixture components depend on time, i.e., \(Y_t \sim \sum _{m=1}^M \pi _m \mathcal {N}(\mu _{m,t}, \sigma ^2_{m,t})\) for \(t\in [0,119]\). The additive structured predictors for these distribution parameters are defined as

$$\begin{aligned} h^{-1}(\mu _{m,t}) = \beta _{0,m,1} + \phi _{m,1}(t) \quad \text{ and } \quad h^{-1}(\sigma _{m,t}) = \beta _{0,m,2} + \phi _{m,2}(t), \end{aligned}$$

where the nonlinear smooth functions \(\phi\) are modeled by thin-plate regression splines (Wood 2003). Previous approaches for modeling this dataset investigated the use of a mixture of mixed models, i.e., the inclusion of gene-specific random effects (Luan and Li 2003; Grün et al. 2011). We investigate here an alternative option for modeling this additional heterogeneity by allowing for time-varying standard deviations. We use \(M=6\) which corresponds to the number of mixture components identified by Spellman et al. (1998).

5.2 Results

In order to plot the estimated smooth effects together with the true observations, we first derive the component membership of every gene. As done in the E-step of mixture model approaches (see, e.g., Grün et al. 2011), we calculate the a posteriori probability for every gene to belong to component m and then take the maximum of all components \(1,\ldots ,M=6\). For this application, observations were only assigned to 5 of the 6 assumed components. Note that due to the nature of our optimization routine, not all components necessarily contain at least one observation.

Fig. 6
figure 6

Trajectories of ORFs (y-axis) for all 4489 genes (black lines) over the course of the 18 time points (x-axis) with the estimated mean trend (red solid line) per component (facets) and uncertainty visualized by two times the estimated (time-varying) standard deviation (shaded red area)

Comparing the number of genes assigned to each cluster, one sees that the most common component in our results is cluster 6 with 39,078 observations. Cluster 4 contains 25,722 observations, cluster 1 9306 and cluster 5 6552 observations. Least number of observations is assigned to cluster 2 which contains only 144 observations.

Figure 6 visualizes the results obtained in a panel plot where in each panel the trajectories of the ORFs for all genes assigned to this cluster are shown together with the component-specific estimates of the time-varying means and standard deviations. The identified clusters clearly vary in showing either an initial decrease or increase in their means. In addition, one also sees that the standard deviations of the clusters vary over time with some clusters exhibiting a particularly large amount of heterogeneity at later time points.

6 Outlook

We have introduced the class of mixtures of experts distributional regression with additive structural predictors and investigated its embedding into neural networks for robust model estimation. Overall this leads to the neural mixture of experts distributional regression (NMDR) approach. We show that popular first-order adaptive update routines are well suited for learning these mixture of experts (distributional) regression models and also highlight that the embedding into a neural network estimation framework allows for straightforward extensions of the general mixture model class and (regularized) maximum-likelihood estimation using optimization routines suitable also for big data applications due to mini-batch training. Using the proposed architecture for mixture of experts distributional regression, a possible extension of our approach is therefore the combination with other (deep) neural networks. This allows learning both the distribution components and the mixture weights either by (a) a structured model, such as a linear or additive model, (b) a custom (deep) neural network or (c) a combination thereof. A similar approach has been investigated by Fritz et al. (2022) using a zero-inflated Poisson model (i.e., a mixture including a point mass distribution) which includes both additive effects and a graph neural network in the additive predictor.