1 Introduction

Models of consumer heterogeneity play a pivotal role in marketing and economics. Typical applications are random coefficients or mixed logit models for aggregate or panel data (e.g., Revelt and Train 1998 and Train 2009), and hierarchical Bayesian models. Influential applications of these models involve inference from household scanner panel data or from discrete choice experiments (e.g., Allenby and Lenk 1994, Rossi et al. 1996, Allenby et al. 1998, Dubé et al. 2010, and Sawtooth 2013). In most applications, the inferential target pertains to a population beyond the sample of consumers providing the data for model calibration. For example, pricing, product design, or product line decisions informed by the sample data through the model are expected to be optimal in the population and not just in the observed, finite sample. The population model, the heterogeneity or random coefficients distribution is the natural and correct basis for generalizations from the observed sample of consumers or respondents to the market. The fact that inferences about parameters of this distribution are consistent in the sample size (N), even if the number of observations contributed by each consumer (T) is very small, makes this approach attractive from a statistical perspective.

Unfortunately, standard population distributions often lack economic rationality. For example, Reiss and Wolak (2007) remark that the estimated distribution of marginal utility of fuel economy in Berry et al. (1995) suggests that about half of consumers in the car market dislike fuel economy. As another example, Dubé et al. (2008, 2010) find support for positive price coefficients in the inferred heterogeneity distribution. Such economically unreasonable characterizations of consumer heterogeneity prevent meaningful counterfactual predictions from the model. As an obvious example, models that support positive price coefficients in the inferred heterogeneity distribution preclude model based price optimization.

While a completely theory driven specification of heterogeneity distributions appears to be beyond reach, some authors argue in favor of theory driven constraints in the population distribution (e.g., Boatwright et al. 1999 and Allenby et al. 2014). The goal is a heterogeneity model that is maximally flexible regarding some aspects of the population distribution, but deterministically constrained by economic theory regarding other aspects of this distribution. This paper builds on this idea and develops it further.

In applications, a prior understanding of preferences in the population often suggest a large number of sign and order restrictions, for example: that the price parameter in an indirect utility function is negative or that consumers prefer a more fuel efficient to a less fuel efficient car, everything else equal. So called constrained parameter problems are relevant across academic fields and a body of literature dealt with this topic. Gelfand et al. (1992) provide an overview of how to impose sign and order constraints based on truncated distributions using Gibbs sampling. Allenby et al. (1995) introduce this approach into marketing in the context of individual level conjoint analysis. Boatwright et al. (1999) develop a sampler in the spirit of Gelfand et al. (1992), but for a hierarchical sales response regression model.

However, sign and order restrictions in models of heterogeneity still present unresolved challenges. In principle, one could adopt truncated normal distributions that implement prior constraints as outlined in Gelfand et al. (1992) for heterogeneity distributions. However, as we show below, any truncated distribution of heterogeneity leads to a so called “doubly intractable” inference problem. The log-normal prior avoids this difficulty. The basic idea of using log-normal distributions to implement sign and order constraints is not new. For example, Allenby et al. (2014) use the exponential transformation, \(\beta _{p} = -\exp (\beta _{p}^{\ast })\) with \(\beta _{p}^{\ast } \in \mathbb {R}\) distributed according to a hierarchical normal mixture prior, to enforce that the model has zero support for positive price coefficients. In this specification, the problem is that \(\beta _{p}^{\ast }\) is measured on the log scale and standard diffuse subjective prior settings imply absurdly large and small values of transformed coefficients βp (e.g., Allenby et al. 2014).Footnote 1 In the common situation where the heterogeneity distribution thus comprises both constrained and unconstrained coefficients, the choice of subjective prior parameters is an unresolved challenge.

As a solution to this problem we propose a marginal-conditional decomposition that avoids the conflict between wanting to be more subjectively informative about constrained parameters and only weakly informative about unconstrained parameters. We show that this decomposition is important whenever the heterogeneity distribution comprises a mix of constrained and unconstrained coefficients, e.g., brand and price coefficients. Our decomposition applies both to the fully parametric multivariate normal setting as well as to its semi-parametric generalizations. In addition, we show how to efficiently sample from the implied posterior building on the likelihood based pre-tuning of proposal densities in Rossi et al. (2005).

Finally, we contrast profit implications of relying on the inferred population distribution to an ad-hoc approach that approximates heterogeneity using means of individual level coefficients. This latter approach is still common in applied academic and industry research. It is ad-hoc because if fails to measure heterogeneity consistently, distorting inference towards the population mean. As a consequence, markets will misleadingly appear too homogeneous, translating into too little product differentiation and too much price competition in counterfactual calculations. A side-effect of this distortion is a reduction of sign and order violations in the approximated heterogeneity distribution that likely contributed to the popularity of this ad-hoc approach.

In a nutshell the goal of this paper is to facilitate the formulation of more economically faithful hierarchical prior distributions of heterogeneity for better market simulators and improved counterfactual calculations. We thereby hope to broaden the applicability of models of heterogeneity, and to convince applied academic and industry researchers to abandon market simulators built on means of individual level preferences. The remainder of the paper proceeds as follows: Section 2 formally introduces different ways of generalizations from hierarchical Bayesian models and discusses implications for market simulation. In Section 3 we develop the hierarchical prior formulation and in Section 4 we discuss efficient sampling of individual level coefficients. Section 5 then investigates the relative performance of the proposed approach using simulated data. Sections 6 and 7 report the results from two empirical illustrations based on household scanner panel data on purchases of fresh hen’s eggs (Kotschedoff and Pachali 2020) and data from a discrete-choice experiment on tablet PCs. Finally, we summarize and discuss results in Section 8.

2 Different ways of generalizations and market simulations

Different ways of generalizing from hierarchical models to consumer preferences, choices, and market shares in the target population are best illustrated in a decision theoretic framework. For this purpose, and without loss of generality, we abstract away from competition and fixed costs, and assume constant marginal prices and costs in the following. If the decision-maker knew the distribution of preferences in the population denoted as p(β|τ), he would choose the action aA that maximizes profits \( {\int \limits } \pi (a,\beta )\ p(\beta |\tau )\ d\beta = \mathbb {E}_{\beta \vert \tau }\left [\pi (a,\beta )\right ] = {\pi }(a)\) by solving the following maximization problem:

$$ \max_{a\in A} \left\{{\pi}(a) \propto \left( P(a)-C(a)\right) \int \text{MS}(a,\beta) p(\beta|\tau)\ d\beta \right\} $$
(1)

Here MS(a,β) is the market share from action a and preference β, as implied by a choice model, C(a) denotes marginal costs associated with action a, and P(a) the marginal price, which may itself constitute an action; thus (P(a) − C(a)) is the contribution margin. Finally, the proportionality results from ignoring the market size.

Because the preference distribution in the population is generally unknown, the decision-maker forms an expectation about profits based on data \(Y=\left (\begin {array}{llllll} y_{1} & {\dots } & y_{i} & {\dots } & y_{N} \end {array}\right )\), where yi is the Ti-vector of observations from individual i in the sample, and based on prior assumptions about the choice model underlying MS(a,β), the distribution of preferences in the population p(β|τ), and the parameters τ in this distribution. He then maximizes the posterior expected profit:

$$ \hat{\pi}(a) = \mathbb{E}_{\beta|Y}\left[\pi(a,\beta)\right] \propto \left( P(a)-C(a)\right) \int \text{MS}(a,\beta) p(\beta|\tau) p(\tau|Y)\ d\left( \beta,\tau \right) $$
(2)

This estimator of expected profits entirely relies on posterior knowledge of the hierarchical prior distribution. We thus refer to this approach as “generalizing based on the hierarchical prior”. It is easily computed to an arbitrary degree of precision based on MCMC draws from the posterior distribution p(τ|Y ) coupled with draws from the hierarchical prior distribution p(β|τ). However, because it entirely relies on the posterior of the hierarchical prior, all prior parametric assumptions will come to bear. If, for example, the hierarchical prior supports positive and negative price coefficients as in a normal distribution, the posterior of the hierarchical prior will necessarily—and may substantially—support positive price coefficients. The problem may persist even if the data reliably locate all individual specific posterior price coefficient distributions in the negative domain. The reason is that the best normal approximation matches the first and second moment of the distribution to be fitted, which may result in substantial support for positive coefficients even if all coefficients to be fitted are negative.

To mitigate the extrapolation of parametric assumptions in directions that violate economic theory, market simulators often rely on the collection of individual level posterior mean estimates \(\{\hat {\beta }_{i}\}_{i=1}^{N}\) where \(\hat {\beta }_{i}={\int \limits } \beta _{i} p(\beta _{i}|Y,y_{i})d\beta _{i}\) —the shrinkage of individual level posterior means to the population mean in general reduces the number of sign and order violations, albeit at the expense of severely inconsistent inferences about heterogeneity. Expected profits from action a are then estimated as:

$$ \hat{\pi}(a) \propto \left( P(a)-C(a)\right) \frac{1}{N}\sum\limits_{i=1}^{N} \text{MS}(a,\hat{\beta}_{i}) $$
(3)

However, as we illustrated in Appendix A.1, this estimator that aggregates optimal, in the sense of a bias-variance trade-off, individual level estimates, itself fails optimality criteria and is inconsistent no matter how large the sample of consumers N, as long as individual level likelihoods are not perfectly informative about individual level preferences. In practice, individual level likelihoods tend to be diffuse, which motivates hierarchical models in the first place.

A third estimator of expected profits from action a builds on the collection of individual level posterior distributions. We refer to this form of generalization as lower level model non smoothed (n.s.) because it relies on the lower, individual level models, but does not summarize individual level posteriors to estimates.

$$ \hat{\pi}(a) \propto \left( P(a)-C(a)\right) \frac{1}{N}\sum\limits_{i=1}^{N} \int \text{MS}(a,\beta_{h}) p(\beta_{h}|y_{i},\tau) p(\tau|Y)\ d\left( \beta_{h},\tau \right) $$
(4)

The difference between this estimator and that defined in Eq. 2 is that yi is used both to inform the posterior p(τ|Y ) and the prediction to new consumers’ preferences in p(βh|yi,τ). When individual level posterior distributions essentially degenerate to a point because of highly informative individual level likelihoods, the estimator in Eq. 4 converges to that defined in Eq. 3. When individual level posterior distributions come from diffuse individual level likelihoods, as usual, the estimator in Eq. 4 will be very similar to that in Eq. 2. Thus, parametric assumptions in the hierarchical prior distributions will be similarly influential. Consistent with these assessments, we only find negligible differences between generalizations based on the posterior of the hierarchical prior and lower level model n.s. in the empirical applications discussed below.

What way of generalization should we use for market simulation in practice? Every trained Bayesian analyst will point out the inconsistency associated with relying on the collection of individual level posterior means. Such an analyst knows that posterior predictive preference distributions as defined in Eqs. 2 and 4 allow for consistent inference (in N), however conditional on functional form assumptions.

However, because standard parametric and semi-parametric assumptions such as multivariate normal or its finite mixture generalization violate basic economic intuition in many applications, consistency conditional on these assumptions is not too helpful. Thus, many applied researchers and practitioners opt for generalizations, i.e., market simulation based on the collection of individual level posterior means (Eq. 3) that often substantially reduce the share of sign and order violations. We aim to overcome the choice between relying on the posterior of a mis-specified hierarchical prior and the collection of individual level posterior means that fail to measure heterogeneity, by showing how to specify more economically faithful hierarchical prior distributions based on prior constraints. The goal is a hierarchical prior that both is maximally flexible regarding some aspects of the population distribution of preferences, and deterministically constrained by theory regarding other aspects of this distribution.

3 Sign and order constraints

Sign and order constraints dogmatically express prior knowledge about the support of a distribution, e.g., that the price parameter in an indirect utility function is negative or that a consumer prefers a more fuel efficient to a less fuel efficient car for sure, everything else equal. So called constrained parameter problems are relevant across academic fields and a body of literature dealt with this topic. Gelfand et al. (1992) provide an overview of how to impose sign and order constraints based on truncated distributions using Gibbs sampling. Allenby et al. (1995) introduce this approach into marketing in the context of individual level conjoint analysis. Boatwright et al. (1999) develop a sampler in the spirit of Gelfand et al. (1992), but for a hierarchical sales response regression model.

However, the implementation of sign and order restrictions in hierarchical Bayesian models is still without a generally accepted solution. In principle, one could adjust the sampler outlined by Gelfand et al. (1992) to hierarchical settings. However, as we show next, any truncation applied to the prior (and hence to the posterior) of individual level coefficients in a hierarchical setting leads to a so called “doubly intractable” inference problem in the hierarchical prior. Doubly intractable problems are characterized by a normalization constant that depends on target parameters (e.g., Möller et al. 2006 and Murray et al. 2006). Consider the following truncated normal hierarchical prior for consumers’ demand parameters:

$$ p(\beta | \bar{\beta},V_{\beta}) = \frac{\varphi(\beta |\bar{\beta},V_{\beta})}{\mathbb{Z}(\bar{\beta},V_{\beta})} \mathbf{1}(\beta \in \mathbb{R}^{k}_{c}), $$
(5)

where \(\mathbb {R}^{k}_{c}\) denotes the truncation region of a k-dimensional demand parameter vector β, φ denotes the multivariate normal density and \(\mathbb {Z}(\bar {\beta },V_{\beta })\) the corresponding normalizing constant:

$$ \mathbb{Z}(\bar{\beta},V_{\beta}) = {\int}_{\mathbb{R}^{k}_{c}} \varphi(\beta |\bar{\beta},V_{\beta}) d \beta $$
(6)

The conditional posterior distribution of parameters indexing the hierarchical prior then becomes:

$$ p(\bar{\beta},V_{\beta} | \{\beta_{i}\} ) \propto \prod\limits_{i=1}^{N}\frac{\varphi(\beta_{i} |\bar{\beta},V_{\beta})}{\mathbb{Z}(\bar{\beta},V_{\beta})} \mathbf{1}(\beta_{i} \in \mathbb{R}^{k}_{c}) p(\bar{\beta},V_{\beta}), $$
(7)

where \(p(\bar {\beta },V_{\beta })\) denotes the subjective prior for hierarchical prior parameters. Equation 7 is an example of a doubly intractable inference problem because even after dropping the normalization constant \(\int \limits \left ({\prod }_{i=1}^{N}\frac {\varphi (\beta _{i} |\bar {\beta },V_{\beta })}{\mathbb {Z}(\bar {\beta },V_{\beta })} \mathbf {1}(\beta _{i} \in \mathbb {R}^{k}_{c}) p(\bar {\beta },V_{\beta }) \right )d\{\bar {\beta },V_{\beta }\}\) of the posterior giving rise to the proportionality, we are left with the intractable expression \({\mathbb {Z}(\bar {\beta },V_{\beta })}\). This expression normalizes the multivariate normal density to the region of support defined by \(\mathbb {R}^{k}_{c}\) and cannot be dropped because it depends on target parameters \(\bar {\beta }\) and Vβ.Footnote 2

As a consequence of truncation, we loose the convenience of conditionally conjugate updates of hierarchical prior parameters \(\bar {\beta }\) and Vβ regardless of what subjective prior distributions we employ. More generally, all estimation and sampling techniques that require the evaluation of the conditional “likelihood” \(p(\{\beta _{i}\} | \bar {\beta },V_{\beta }) ={\prod }_{i=1}^{N}\frac {\varphi (\beta _{i} |\bar {\beta },V_{\beta })}{\mathbb {Z}(\bar {\beta },V_{\beta })}\), including standard Metropolis-Hastings sampling, are hamstrung by the intractability of \({\mathbb {Z}(\bar {\beta },V_{\beta })}\).Footnote 3Boatwright et al. (1999) propose to numerically approximate \({\mathbb {Z}(\bar {\beta },V_{\beta })}\) at each MCMC iteration using the GHK algorithm (Hajivassiliou et al. 1996). While this seems reasonable in their application that involves sign constraints on at most four parameters in a model with five parameters in total, numerical approximations will be problematic in the high-dimensional parameter spaces, potentially involving a multiplicity of constraints that have become common in applications more recently.

The log-normal hierarchical prior avoids this difficulty. The basic idea of using log-normal distributions to implement sign and order constraints is not new. For example, Allenby et al. (2014) use the exponential transformation, \(\beta _{p} = -\exp (\beta _{p}^{\ast })\) with \(\beta _{p}^{\ast } \in \mathbb {R}\) and distributed according to a hierarchical normal mixture prior, to enforce that the model has zero support for positive price coefficients. In this specification, the problem is that \(\beta _{p}^{\ast }\) is measured on the log scale and standard diffuse subjective prior settings imply absurdly large and small values of transformed coefficients βp (e.g., Allenby et al. 2014).

Thus, the problem is how to specify differentially informative subjective priors for constrained coefficients and unconstrained coefficients. The standard Normal-Inverse-Wishart (NIW) subjective prior for means and covariance matrices in the hierarchical prior distribution is limited in this regard—mostly because the prior concentration of the IW-prior is controlled by a single parameter (the prior degrees of freedom also known as the prior shape).

Next, we present a solution to this problem that re-parameterizes the hierarchical prior. Our contributions in this context are, first, a marginal-conditional decomposition of the hierarchical prior distribution that enables the analyst to be differentially informative about the distribution of constrained and unconstrained parameters in the population a prioriFootnote 4, and second, the generalization of the pre-tuning of proposal densities in Rossi et al. (2005) to this hierarchical prior.

The proposed marginal-conditional decomposition becomes essential whenever the hierarchical prior comprises both constrained and unconstrained parameters such as e.g., in simple hierarchical choice models that feature brand coefficients and a price coefficient. The proposed generalization of pre-tuned proposal densities (Rossi et al. 2005) is particularly important in high dimensional models that feature a multiplicity of constraints.

3.1 Marginal-conditional decomposition

Our hierarchical prior starts with a standard normal distribution.Footnote 5 Unconstrained coefficients have a normal hierarchical prior while sign and order constraints are imposed through exponential transformations of normal variates resulting in log-normally distributed coefficients. Vice versa, we can log-transform from sign and order constrained parameters that enter the likelihood to unconstrained, a priori conditionally normally distributed variates. We formulate subjective priors over this unconstrained space but use a marginal-conditional decomposition to implement vastly different subjective priors for parameters that are exponentiated and those that are not.

We denote \(g:\mathbb {R}^{k} \rightarrow \mathbb {R}_{c}^{k}\) as the function that maps normally distributed variates \(\beta _{i}^{\ast }\) to sign and order constrained coefficients βi that enter multinomial likelihoods explaining individual choice data yi. We distinguish kc “constrained” coefficients \(\beta _{i}^{\ast c}\), i.e., coefficients to be transformed to obey sign and order constraints, and kuc unconstrained coefficients \(\beta _{i}^{\ast uc}\) in the hierarchical prior.

$$ \begin{array}{@{}rcl@{}} y_{i} | g(\beta_{i}^{\ast}) &\sim& MNL\left( y_{i} | g(\beta_{i}^{\ast})\right)\\ \beta_{i}^{\ast} &\sim& N\left( \bar{\beta}^{\ast},V_{\beta^{\ast}}\right),\ \text{or}\\ \left( \begin{array}{llllll} \beta_{i}^{\ast c} \\ \beta_{i}^{\ast uc} \end{array}\right) &\sim& N\left( \left( \begin{array}{llllll} \mu_{c}^{\ast} \\ \mu_{uc}^{\ast} \end{array}\right), \left( \begin{array}{llllll} V_{\beta_{11}^{\ast}} & V_{\beta_{12}^{\ast}} \\ V_{\beta_{21}^{\ast}} & V_{\beta_{22}^{\ast}} \end{array}\right) \right) \end{array} $$
(8)

With the goal of formulating rather different subjective priors for the parameters governing the distribution of \(\beta _{i}^{\ast c}\) and \(\beta _{i}^{\ast uc}\), we re-express the multivariate normal distribution in Eq. 8 in the form of a multivariate regression model that regresses unconstrained coefficients \(\beta _{i}^{\ast uc}\) on “constrained” coefficients \(\beta _{i}^{\ast c}\):

$$ B^{\ast uc} = \left( \begin{array}{llllll} \iota & B^{\ast c} \end{array}\right) \left( \begin{array}{llllll} z^{\prime} \\ {\Gamma} \end{array}\right) + U \qquad vec(U^{\prime}) \sim N(0,I_{N} \otimes {\Sigma}) $$
(9)

Here, Buc and Bc are matrices with kuc and kc columns, respectively, and N rows each, collecting unconstrained and “constrained” coefficients from individuals in the sample, and ι is a (N × 1)-vector of 1’s; Γ is a (kc × kuc) matrix of regression coefficients, z a column vector of intercept coefficients of length kuc, and Σ is the (kuc × kuc) conditional variance-covariance of unconstrained coefficients in the population.

The first two moments of the distribution of “constrained” coefficients are obtained from yet another multivariate regression model that regresses “constrained” coefficients on a vector of constants:

$$ B^{\ast c} = \iota (\mu_{c}^{\ast})^{\prime} + U_{V^{\ast}} \qquad vec(U_{V^{\ast}}^{\prime}) \sim N(0,I_{N} \otimes V^{\ast}) $$
(10)

Here, ι is again a (N × 1)-vector of 1’s and V is the marginal variance-covariance matrix of constrained coefficients. The multivariate regression models in Eqs. 9 and 10 imply the following re-parameterization of the joint distribution of \(\beta ^{\ast }_{i}\) from Eq.8:

$$ \beta^{\ast}_{i} \sim N\left( \left( \begin{array}{llllll} \mu_{c}^{\ast} \\ {\Gamma}^{\prime} \mu_{c}^{\ast} + z \end{array}\right), \left( \begin{array}{llllll} V^{\ast} & V^{\ast} {\Gamma} \\ {\Gamma}^{\prime} (V^{\ast})^{\prime} & {\Gamma}^{\prime} V^{\ast} {\Gamma} + {\Sigma} \end{array}\right) \right) $$
(11)

The advantage of the re-parameterization in Eq. 11 relative to the more standard parameterization in Eq. 8 is that we can now specify arbitrarily informative subjective priors for the hierarchical prior distribution of “constrained” coefficients, i.e., for the parameters \(\mu _{c}^{\ast }\) and V without restricting the prior of unconstrained coefficients. That is, if we a priori set V to a “small” covariance matrix, we can nevertheless elect to be minimally informative about the distribution of unconstrained parameters through Σ. Coupled with weakly informative priors for Γ and z, neither the correlation between “constrained” and unconstrained nor the marginal mean of unconstrained coefficients is directly affected by informative prior specifications for \(\mu ^{*}_{c}\) and V.

However, the role of the prior on Γ in the implied prior for the covariance of unconstrained coefficients (see the lower right block of the covariance matrix in Eq. 11) requires additional discussion. A priori, an increasing number of constrained coefficients coupled with a diffuse prior on Γ implies a marginal prior for the variance of unconstrained coefficients that may appear as favoring larger variances. In this context, it is important to keep in mind that the variance contribution through Γ is through the covariance between “constrained” and unconstrained coefficients (see the upper right and lower left block of the covariance matrix in Eq. 11). Thus, the prior implication of large marginal variances of unconstrained coefficients stems from “mixing” over strong and qualitatively different (positive or negative) dependencies between constrained and unconstrained coefficients. However, strong dependence between “constrained” and unconstrained coefficients constitutes an extremely informative hierarchical prior. Hence, “mixing” over strong and qualitatively different (positive or negative) dependencies between constrained and unconstrained coefficients is not a possibility a posteriori, not even in small data sets. For example, even smallish data sets will enforce a choice between the two highly informative opposites of strong positive and strong negative dependence between a constrained and an unconstrained coefficient. In sum, large variances of unconstrained coefficients through Γ a posteriori result from strong dependence between “constrained” and unconstrained coefficients as per the likelihood.

Before going into more detail about suggested subjective choices, we illustrate the problem of formulating sensible priors for constrained coefficients in the smallest possible example where \(\beta _{i} = -\exp (\beta _{i}^{\ast }), \beta _{i}^{\ast } \sim N({\bar \beta }^{\ast },V_{\beta ^{\ast }})\). Here, the subjective prior is on parameters \({\bar \beta }^{\ast }\) and \(V_{\beta ^{\ast }}\) in the normal distribution that generates \(\beta _{i}^{\ast }\). Under what is widely considered a weakly informative subjective prior settingFootnote 6 for \({\bar \beta }^{\ast }\) and \(V_{\beta ^{\ast }}\), we obtain that a priori 25% of the constrained coefficients {βi} are larger than − .001, i.e., very close to zero, and another 25% are smaller than − 1054 (see the right column in Table 1).

Table 1 Quantiles of marginal prior densities for a constrained coefficient with informative and standard weakly informative subjective priors

This concentration of mass in the tails of the prior is undesirable and counter to what one would expect from a weakly informative prior for βi. The prior for βi in the column on the left in Table 1 has lower (upper) quartiles of − 8.977 (− .113) and appears to be much more reasonable for, say, the population distribution of price coefficients in a heterogeneous multinomial logit model. However, this marginal prior distribution requires subjective priors for \({\bar \beta }^{\ast }\) and \(V_{\beta ^{\ast }}\) discussed next that in most applications would be considered unduly informative as a prior for unconstrained coefficients where \(\beta _{i} = \beta _{i}^{\ast }\).

We use the fully conjugate prior for (Γz,Σ), where \({\Gamma }_{z} := \left (z , {\Gamma }^{\prime } \right )^{\prime }\), and the conditionally conjugate prior for (\(\mu _{c}^{\ast },V^{\ast }\)):

$$ \begin{array}{@{}rcl@{}} p\left( {\Gamma}_{z},{\Sigma} \right) &=& p\left( {\Gamma}_{z}|{\Sigma} \right) p({\Sigma})\\ \gamma_{z}|{\Sigma} &\sim& N(\bar{\gamma}_{z},{\Sigma} \otimes A_{{\Gamma}_{z}}^{-1}),\ \gamma_{z} := vec\left( {\Gamma}_{z}\right)\\ {\Sigma} &\sim& IW(\nu_{\Sigma},\bar{\Sigma})\ \text{and}\\ p(\mu_{c}^{\ast},V^{\ast}) &=& p(\mu_{c}^{\ast}) p(V^{\ast}),\\ \mu_{c}^{\ast} &\sim& N(\bar{\mu}_{c}^{\ast}, A_{\mu_{c}^{\ast}}^{-1})\\ V^{\ast} &\sim& IW(\nu_{V^{\ast}},\bar{V}^{\ast}) \end{array} $$
(12)

The conditionally conjugate prior for (\(\mu _{c}^{\ast },V^{\ast }\)) enables the researcher to directly express prior beliefs about the distribution of “constrained” coefficients in the population. We set \(\bar {\mu }_{c}^{\ast } = \left (\begin {array}{llllll} 0 & {\dots } & 0 \end {array}\right )^{\prime }\), \(A_{\mu _{c}^{\ast }} = 0.1 I_{k_{c}}\), \(\nu _{V^{\ast }} = k_{c} + 15\) as well as \(\bar {V}^{\ast } = 0.5 \nu _{V^{\ast }} I_{k_{c}}\), where \(I_{k_{c}}\) is the identity matrix of dimension \(k_{c} \times k_{c}\) (cf. Allenby et al. 2014). Especially the choice of prior degrees of freedom \(\nu _{V^{\ast }}\), i.e., the shape parameter in the IW prior for V, would be considered unduly informative as a default value in the context of only unconstrained parameters. However, our marginal-conditional decomposition of the hierarchical prior enables the analyst to be arbitrarily informative about the hierarchical prior for “constrained” coefficients, essentially without affecting the marginal hierarchical prior for unconstrained coefficients.

The fully conjugate prior for (Γz,Σ) adjusts the influence of the subjective prior on Γz as a function of the conditional variance-covariance Σ, which is desirable in situations without much prior knowledge. We use standard weakly informative, “barely proper” priors for parameters in the conditional hierarchical prior of unconstrained coefficients, \(\bar {\gamma }_{z},\ A_{{\Gamma }_{z}},\ \nu _{\Sigma },\ \bar {\Sigma }\).

Our marginal-conditional decomposition corresponds to the directed acyclic graph in Fig. 1 which shows that the hierarchical prior for “constrained coefficients”, (\(\mu _{c}^{\ast },V^{\ast }\)), and that of unconstrained coefficients, (Γz,Σ), are independent conditional on draws of “constrained” coefficients, Bc. This conditional independence relationship gives rise to a Gibbs-sampler for the two-stage update of parameters in the hierarchical prior:

  1. 1.

    \(\beta _{i}^{\ast }|(\mu _{c}^{\ast },V^{\ast }),({\Gamma }_{z},{\Sigma }),y_{i},\ i=1,\dots ,N\)

  2. 2.

    \(\left \{{\Gamma }_{z},{\Sigma }\right \}| B^{\ast uc},B^{\ast c} \)

  3. 3.

    \(\left \{\mu _{c}^{\ast },V^{\ast }\right \}| B^{\ast c}\)

Fig. 1
figure 1

Marginal-conditional decomposition DAG

In step 1, we use a random walk Metropolis-Hastings (RW-MH) step to draw individual level parameters \(\left \{\beta _{i}^{\ast }\right \}\) based on multinomial logit likelihoods, similar to Rossi et al. (2005). However, as described in detail in the following Section 4, we need to account for the change of variables in \(g:\mathbb {R}^{k} \rightarrow \mathbb {R}_{c}^{k}\) when tuning the MH-proposal using information from the likelihood. In step 2, we use a Gibbs-sampler to update Γz and Σ, i.e., parameters in a fully conjugate multivariate regression model, conditional on both “constrained” and unconstrained coefficients and subjective prior parameters (omitted for simplification). Step 3 employs another Gibbs-step to update (\(\mu _{c}^{\ast },V^{\ast }\)), i.e., parameters in a conditionally conjugate multivariate regression model, conditional on “constrained” coefficients and subjective prior parameters. Appendix A.2 details the posterior distributions associated with steps two and three.

4 Efficient MH-sampling

Next we discuss efficient sampling of individual level part worth coefficients \(\{\beta _{i}^{\ast }\}\) based on pre-tuned proposal densities in a MH-sampler conditional on draws of hierarchical prior parameters (Rossi et al. 2005). Our algorithmic implementation is for a MNL model at the individual level, but the approach obviously generalizes to other likelihoods. The pre-tuning in Rossi et al. (2005) employs a normal approximation to the likelihood. The MNL-likelihood information about \(\{\beta _{i}\}\) can be computed in closed form. However, our hierarchical prior is on the distribution of \(\{\beta _{i}^{\ast }\}\); therefore, we need to account for the change-of-variables in \(g:\mathbb {R}^{k} \rightarrow \mathbb {R}_{c}^{k}\).

Following Rossi et al. (2005), we specify the proposal density of the RW-MH sampler as follows

$$ \beta_{i}^{\ast cand} \sim N\left( \beta_{i}^{\ast r },c^{2} \left( H^{\ast}_{i}+(V_{\beta^{\ast}}^{r})^{-1}\right)^{-1}\right), $$
(13)

where \(r\in \left \{1,\dots ,R\right \}\) is the r-th iteration of the MCMC chain, c denotes a fixed scaling factor and \(H^{\ast }_{i}\) is the Hessian information about \(\beta _{i}^{\ast }\) in individual i’s data, evaluated at the maximum of the following fractional likelihood:

$$ l_{i}^{\text{fract}}\left( \left\{y_{i}\right\}_{i=1}^{N}|g(\beta_{i}^{\ast})\right) = MNL\left( y_{i}|g(\beta_{i}^{\ast})\right)^{1-w}\ MNL\left( \left\{y_{i}\right\}_{i=1}^{N}|g(\beta_{i}^{\ast})\right)^{w (T_{i}/\bar{T})} $$
(14)

This fractional likelihood is defined as a w-weighted combination of the individual specific likelihood and the likelihood of a model that pools all observations, where Ti is the number of choice observations from individual i and \(\bar {T}\) is the total number of choices made by all individuals in the calibration sample.

At the maximizing value \(\check {\beta _{i}}\) we can straightforwardly transform to \(\check {\beta _{i}}^{\ast }\) by standard maximum likelihood theory. We obtain the corresponding \(H_{i}^{\ast }\) in Eq. 15, taking advantage of the closed form expression for the information about βi, denoted Hi, from individual i’s choices in the MNL model, and accounting for the change of variable to a first order approximation.Footnote 7

$$ H_{i}^{\ast} \approx \left( J_{g}\right)^{\prime} H_{i} J_{g} $$
(15)

Here Jg is the k × k Jacobian of the function \(g(\beta _{i}^{\ast })\) that maps conditional normally distributed variates \(\beta _{i}^{\ast }\) to their sign and order constrained counterparts βi. Hi and Jg are evaluated at \(\check {\beta _{i}}\) and \(g^{-1}(\check {\beta _{i}})=\check {\beta _{i}}^{\ast }\) respectively, i.e., at the parameter value that maximizes the fractional likelihood in Eq. 14.

Appendix A.4 illustrates the value of the proposed tuning in the MH-upate of \(\beta _{i}^{\ast }\) in a small simulation that only involves choices of one individual. We find that the proposed tuning results in a sampler that is on average about 3.7 times more efficient than that using a simpler and more standard tuning (see Table 15). We note that these differences can magnify substantially in a hierarchical setting.

5 Simulation study

Next we illustrate the benefits of our proposed marginal-conditional decomposition in the presence of sign and order constraints using simulations. First, we compare prior distributions in the prototypical setting that combines constrained and unconstrained coefficients. Second, we analyze the posterior from simulated data under different priors and elaborate on the numerical properties of the proposed methodology.

5.1 Drawing from prior distributions

Suppose a hypothetical setting with two attributes A1 and A2 at two levels L1 and L2 each, yielding four possible product configurations. Both levels of the first attribute provide positive utility to every consumer, and its second level is weakly preferred to the first, again by all consumers. To reflect these sign and order restrictions, we denote the respective coefficients as {β+,i} and {β++,i}, where i = 1,…,N indexes simulated consumers. Preferences for the levels of the second attribute are heterogeneous but without a uniform prior direction or ordering, such as e.g., the preferences for colors or flavors in applications. We denote the respective coefficients as \(\{\beta _{uc_{1},i}\}\) and \(\{\beta _{uc_{2},i}\}\). The price coefficient is negative. We thus have the following set of constraints for every consumer i = 1,…,N:

$$ \begin{array}{@{}rcl@{}} \beta_{+,i},\beta_{++,i} &\geq& 0\\ \beta_{++,i} &\geq& \beta_{+,i}\\ \beta_{p,i} &\leq& 0 \end{array} $$
(16)

First, we compare (implied) marginal priors for coefficients β = g(β) based on the marginal-conditional decomposition in Eq. 11 and a more standard parameterization (Eq. 8) coupled with the more informative subjective prior settings suggested in Allenby et al. (2014). Allenby et al. (2014) propose to adjust the standard weakly informative prior settings to k + 15 (from k + 5) prior degrees of freedom for the IW-prior (where k denotes the dimension of individual demand parameters) in the standard one-component model, and to set the diagonal elements in the prior scale matrix to 0.5 for constrained coefficients and to 1 for unconstrained coefficients. In addition, the subjective prior information for \(\bar {\beta }^{*}\) is increased to \(A_{\mu ^{\ast }} = .1\) (from .01).

However, as described before, the problem with the standard parameterization is that these more informative subjective settings now apply to both constrained, i.e., to be transformed, and to unconstrained coefficients. While these settings yield much more sensible priors for constrained coefficients, they may be unduly informative for unconstrained coefficients.

Figure 2 compares prior distributions based on R = 1,000,000 draws from the positively constrained marginal prior for β+ (left panel) and the unconstrained marginal prior for \(\beta _{uc_{1}}\) (right panel).Footnote 8 In each panel of Fig. 2 the dashed density in orange is from our proposed marginal-conditional specification. The green dash-dotted density is the corresponding marginal prior from Allenby et al. (2014). The figure illustrates the benefit from our proposed parameterization: While marginal priors for the constrained coefficient in the left panel are essentially identical, the standard parameterization coupled with the more informative settings discussed above imply a much more informative marginal prior for unconstrained coefficients than usual. At first sight, the comparison in the right-panel of Fig. 2 seems to suggest that the standard parameterization coupled with the more informative settings from above simply imply less heterogeneity in \(\beta _{uc_{1}}\) a priori. However, it is important to realize that the increase in prior degrees of freedom in the IW prior will similarly fail to accommodate much more homogenous markets than what is implied by the prior settings. In fact, it is the joint possibility of extremely homogenous and extremely heterogenous markets under our suggested prior that causes the pronounced peak at zero together with the fat, sub-exponential tails in the right panel of Fig. 2.

Fig. 2
figure 2

Marginal prior distributions of β+ (left panel) and \(\beta _{uc_{1}}\) (right panel) using the marginal-conditional decomposition and the standard formulation

Next, we illustrate how the difference in subjective priors translates into different posteriors in the typical large N, small T setting.

5.2 Population distribution and data generation

We generate heterogeneous consumer preferences obeying sign and order constraints in Eq. 16 using the following transformation and distribution:

$$ \begin{array}{@{}rcl@{}} \beta^{\ast} &=& \left( \begin{array}{llllll} \beta^{\ast}_{+} \\ \beta^{\ast}_{++} \\ \beta^{\ast}_{p} \\ \beta^{\ast}_{uc_{1}} \\ \beta^{\ast}_{uc_{2}} \end{array}\right) = g^{-1}\left( \beta\right) = \left( \begin{array}{c} \ln\left( \beta_{+}\right) \\ \ln\left( \beta_{++}-\beta_{+}\right) \\ \ln\left( -\beta_{p}\right) \\ \beta_{uc_{1}} \\ \beta_{uc_{2}} \end{array}\right) \sim N\left( \bar{\beta}^{\ast}, V_{\beta^{\ast}}\right),\ \text{with}: \\ \bar{\beta}^{\ast} &=& \left( \begin{array}{ccccc} 0.5 & -0.5 & 0.8 & 2.5 & 2.5 \end{array}\right)^{\prime} \ \text{and} \\ V_{\beta^{\ast}} &=& \left( \begin{array}{llllll} 0.4 & 0.1 & 0 & 0 & 0 \\ 0.1 & 0.2 & -0.15 & 0 & 0 \\ 0 & -0.15 & 0.4 & -0.05 & 0.05 \\ 0 & 0 & -0.05 & 2 & 0 \\ 0 & 0 & 0.05 & 0 & 4 \end{array}\right) \end{array} $$
(17)

Table 2 summarizes the marginal distributions of data generating preferences in the population. Consumers have a decent preference for the two levels of A1 and are relatively price sensitive on average. Preferences for the two levels of A2 have the same expected value, but are more heterogeneous for the second level. Preferences for the first and second level of A1 correlate positively. Furthermore, consumers who prefer the second level of A1 are less price sensitive on average, \(Cov(\beta ^{\ast }_{++},\beta ^{\ast }_{p}) = -0.15\). Similarly, consumers who prefer the first level of A2 are less price sensitive while preferences for the second level correlate positively with the absolute value of the price coefficient.

Table 2 Summary of marginal distributions of data generating coefficients

We generate a sample of N = 1000 consumers with preferences \(\left \{\beta _{i}\right \}\) from this population distribution as input to generating discrete choice data Y. Each choice is from the full set of product alternatives at different, randomly drawn prices from a uniform distribution with support in \(\left [0.5,3\right ]\), plus an outside good. Consequently, there are p = 5 alternatives in each choice set. We fix the amount of individual level information at T = 4. Recall that many discrete choice studies in marketing barely reach one choice task per parameter to estimate at the individual level. The sparse individual level data scenario assumed in this simulation is therefore representative of applications in practice.

We remove the column pertaining to the first level of A1 from the design matrix for identification.Footnote 9 Table 3 shows the mapping between data generating and identified parameters derived from the design matrix. Since we delete the first level of A1 from the design, it follows that \(\beta _{++}^{id}=\beta _{++}-\beta _{+}\), \(\beta _{p}^{id}=\beta _{p}\), \(\beta _{uc_{1}}^{id}= \beta _{uc_{1}} + \beta _{+}\) as well as \(\beta _{uc_{2}}^{id}= \beta _{uc_{2}} + \beta _{+}\).

Table 3 Mapping between data generating and estimated (identified) parameters illustrated in one choice set

5.3 Estimates of heterogeneity

Figure 3 illustrates the benefits of our proposed marginal-conditional decomposition of the hierarchical prior distribution (see Eqs. 910 and 11) compared to the standard formulation (see Eq. 8) coupled with informative subjective prior settings (Allenby et al. 2014) using the example of the unconstrained coefficients \(\beta _{uc_{1}}^{id}\) and \(\beta _{uc_{2}}^{id}\).Footnote 10

Fig. 3
figure 3

Posterior predictive population distributions of \(\beta _{uc_{1}}^{id}\) and \(\beta _{uc_{2}}^{id}\) using the marginal-conditional decomposition and the standard formulation (T = 4)

It is visually apparent that the standard parameterization (Eq. 8) that cannot but impose informative priors on both constrained and unconstrained parameters, when constrained parameters require more informative priors, underestimates the amount of preference heterogeneity in the unconstrained coefficients (see the green dashed-dotted densities in Fig. 3). Note that the bias from unduly informative priors on unconstrained coefficients further amplifies in the context of a mixture of normals prior where fewer observational units contribute likelihood information about the amount of heterogeneity in each mixture component (see Section 6). Finally, Appendix A.5 reports MH-acceptance rates and MCMC trace plots for a qualitative gauge of the numerical performance of the proposed MCMC algorithm that relies on the marginal-conditional decomposition of parameters in the hierarchical prior.

6 Preferences for fresh hen’s eggs

Our first empirical application analyzes Nielsen data on purchases of fresh hen’s eggs by German households (see Kotschedoff and Pachali 2020). It illustrates the empirical relevance of the proposed marginal-conditional decomposition of the hierarchical prior. In Germany, eggs are differentiated in terms of animal welfare as summarized in Table 4.

Table 4 Main differences between egg breeding categories

Since 2004, EU regulations require labeling the breeding category on egg packages and printing a code on each single egg indicating origin and breeding category. Consumers associate the four breeding categories with different quality levels: battery eggs \(\precsim \) barn eggs \(\precsim \) free-range eggs \(\precsim \) organic eggs. In 1999 the EU decided that all member states ban the production of battery eggs by 2012. Germany implemented this ban already in 2010. Kotschedoff and Pachali (2020) (KP) use this policy change to evaluate the effect of this increase in minimum quality standard on consumer welfare. They use a sample of 6,961 households who purchased eggs at least four times in the period of 2008 to 2012.Footnote 11

The demand model in KP assumes that households have full information about the egg products offered by the ten retail chains included in the sample. Accordingly, household i’s indirect utility from egg product g in chain l at period t is

$$ U_{iglt} = \gamma_{i,g} + \alpha_{i} p_{glt} + \beta_{i} \mathbf{1}\{units_{g} = 6\} + \psi_{i,l} + \varepsilon_{iglt}, $$
(18)

where g ∈{Battery,Barn,Free-range,Organic} and \(l\in \{1,\dots ,10\}\). The indicator variable, 1{}, denotes whether egg label g has the package size six instead of ten eggs. The price is given by pglt and the mean utility of the outside option is normalized to zero, uiglt = 0. The error terms εiglt is assumed to follow a type I extreme value distribution, as standard in the literature.

KP state that flexible estimation of the retail chain preference coefficients \(\left \{\psi _{i,l}\right \}\) is particularly important in their demand specification, alleviating a potential bias from the full information assumption implicit to Eq. 18: It is crucial that retail chain preference coefficients become very negative—potentially approaching negative infinity—for those chains a household never or very infrequently purchased eggs from. If a retail chain is estimated to be extremely unattractive to a consumer, the egg prices charged at this chain will not affect this consumer’s egg purchasing decisions, independent of the consumer’s actual price knowledge set. In addition, KP rely on the inferred information about \(\left \{\psi _{i,l}\right \}\) when modeling competition among retail chains in a supply side model.

Here, we rely on the simplified demand framework in Eq. 18 to illustrate the benefits of our marginal-conditional decomposition model as developed in Section 3.Footnote 12 The model is an example of the typical application featuring a mix of constrained and unconstrained coefficients in the context of a hierarchical model. While we cannot a priori constrain preferences for the retail chains and the battery egg taste coefficient, which measures preferences for battery eggs over the outside good, it seems meaningful and actually important to constrain the remaining parameters. This is because the amount of price variation across quality tiers in this data vastly exceeds the amount of temporal price variation within quality tiers. As a consequence, a household who is only observed to purchase the highest price alternative (organic eggs) could be rationalized as exhibiting positive preferences for high prices in a model without economically motivated constraints. Similarly, an unconstrained model could misleadingly rationalize the choice pattern of a household who only purchased the lowest price alternative (battery eggs) based on higher (direct utility) preferences for battery eggs than for qualitatively superior alternatives.

We thus employ the constraints summarized in Table 5. Preferences for the four different egg labels should satisfy the quality ordering implied by Table 4 to identify the price coefficient. Everything else equal, for example, a household should not be worse off consuming an organic egg instead of a battery egg. Furthermore, the coefficient for the smaller package size and the price coefficient are constrained to be negative.

Table 5 Restricted attributes and constraints imposed on levels

Table 6 provides an overview of the number of egg purchase incidents across households in the estimation sample. For most households, we observe a decent number of purchases, resulting in “positive degrees of freedom” at the individual level. The lack of individual level information motivating the use of a hierarchical model is due to the small amount of within quality tier price variation as compared to price variation across quality tiers.

Table 6 Distribution of the number of egg purchase incidents across N = 498 households used in the estimation sample

We compare our model (see Eqs. 9 to 11) to the standard formulation (see Eq. 8) coupled with the informative subjective prior advanced in Allenby et al. (2014). These authors propose a somewhat tighter IW-prior for the variance-covariance matrix in a three component mixture of multivariate normals with prior degrees of freedom equal to k + 25 (where k is the dimensionality of the individual level model). In addition, they set the diagonal elements in the prior scale matrix to 0.5 for unconstrained coefficients and to 0.05 for constrained coefficients in each normal component. Note that we adjust prior degrees of freedom to k + 40 accounting for the fact that, similar as in KP, we rely on a five component (instead of a three component) mixture of normals model in the estimation below.

Compared to the informative subjective prior advanced in Allenby et al. (2014), our marginal-conditional decomposition of the hierarchical prior distribution enables the analyst to be differentially informative about the distribution of constrained and unconstrained parameters in the population a priori. We use the following subjective settings affecting the prior of constrained coefficients in every mixture component: \(\bar {\mu }_{c}^{\ast } = \left (\begin {array}{llllll} 0 & {\dots } & 0 \end {array}\right )^{\prime }\), \(A_{\mu _{c}^{\ast }} = 0.1 I_{k_{c}}\), \(\bar {V}^{\ast } = 0.05 \nu _{V^{\ast }} I_{k_{c}}\) as well as \(\nu _{V^{\ast }} = k_{c} + 40\), where \(I_{k_{c}}\) is the identity matrix of dimension \(k_{c} \times k_{c}\). However, in contrast to Allenby et al. (2014), we can elect to use standard weakly informative, “barely proper” priors for parameters in the conditional prior of unconstrained coefficients: \(\bar {\gamma }_{z} = \left (\begin {array}{llllll} 0 & {\dots } & 0 \end {array}\right )^{\prime }\), \(A_{{\Gamma }_{z}} = 0.01 I_{(k_{c}+1)}\), \(\bar {\Sigma } = \nu _{\Sigma } I_{k_{uc}}\) as well as \(\nu _{\Sigma } = k_{uc} + 5\).

In order to reduce implied computation times, we draw a random subsample of N = 498 households and estimate a model with a five components mixture of normals prior under these two different subjective prior settings.Footnote 13

Figure 4 shows posterior predictive population distributions for the (unconstrained) battery egg coefficient as well as the coefficient measuring preferences for retail chain 5.Footnote 14 Both graphs in Figure 4 confirm the finding from the simulation study in Section 5: By imposing an informative prior on all coefficients (that is really needed for the constrained coefficients only) the standard formulation results in the dashed-dotted densities in green, which underestimate heterogeneity in these unconstrained coefficients. This is particularly apparent in the right panel of Fig. 4, where the marginal posterior from the standard parameterization of the hierarchical prior (see Equation 8)—when coupled with informative subjective priors needed to “discipline” the distribution of constrained coefficients—fails to accommodate extremely negative preferences for retail chain 5 in the left tail.

Fig. 4
figure 4

Posterior predictive population distributions using the marginal-conditional decomposition model and a standard model with informative priors for the battery egg coefficient (left panel) as well as the preference coefficient of the fifth retail chain (right panel)

Table 7 summarizes variances of marginal posterior predictive densities of unconstrained coefficients and verifies that the differences across the two subjective prior specifications are substantial. Finally, Table 8 compares model fit based on the Newton-Raftery estimator of the log marginal likelihood. As one may expect, the indistinctively informative specification in the standard prior parameterization (see Eq. 8) translates into inferior fit compared to the informative specification that selectively targets constrained coefficients facilitated by the marginal-conditional decomposition in Eqs. 9 to 11.

Table 7 Variances of marginal posterior densities unconstrained preference coefficients implied by the marginal-conditional decomposition model and a standard model with informative priors
Table 8 Comparison of log marginal likelihood values across model specifications

7 Tablet PC preferences

Our second empirical application uses data from a commercial discrete-choice conjoint study investigating demand for tablet PCs (“tablets”). Here, we focus on the drawbacks of relying on individual level posterior means \(\hat {\beta }_{i}={\int \limits } \beta _{i} p(\beta _{i}|Y,y_{i})d\beta _{i}\) for market simulation (as defined in Section 2), and estimate implied losses in profits when relying on this method for decision-making. For estimation, we rely on the marginal-conditional decomposition of the hierarchical prior (see Section 3). We show how using posterior means translates into systematic over estimation of preferences for sign- and order-constrained attribute levels. Finally, we show empirically how relying on individual level posterior means reduces sign and order violations, in the absence of a theoretically constrained hierarchical prior—arguably a major reason for the popularity of this approach in practice.

Table 9 lists the tablet attributes and attribute levels included in this study. Overall, there are fourteen attributes including a seven level brand attribute. Because of the commercial origin of the data, brand names are disguised. A total of N = 1046 respondents participated in this study.

Table 9 Attributes and levels in the tablet experiment

Each respondent evaluated thirteen choice sets (T = 13), indicating which if any of the tablets offered in a choice set the respondent would purchase. Each choice set featured three tablets, and an unspecified outside option. Respondents selected the outside or no-buy option in about a quarter (26.6%) of the observed 1,046 × 13 = 13,598 choices. Thus, this is a representative example of the type of high-dimensional “large N, small T” studies that have become the standard in industry applications.

The original goal of this study was to help optimize brand A’s product design given a fixed set of competitor offerings. As typical of industry grade discrete-choice conjoint studies, the number of parameters at the individual level (36 coefficients after imposing identification constraints) by far exceeds the number of individual level observations. As a consequence, a hierarchical model is required, the hierarchical prior’s specification becomes critically important, and—in the likely scenario of heterogeneous preferences—individual level posterior distributions will reflect large amounts of posterior uncertainty about a specific respondent’s preferences.

In combination with the ordinal nature of many of the attributes in this study, a standard hierarchical prior specification leads to questionable results. For instance, Fig. 11 and Table 17 in Appendix A.6 showcase that posterior predictive distributions from an unconstrained hierarchical prior specification coupled with weakly informative subjective priors (e.g., Rossi et al. 2005) clearly violate basic economic intuition. Inferred preferences for levels of cash back refer to the amount of money a customer receives after purchase upon submitting the sales receipt to the manufacturer. According to Table 17 (Appendix A.6), more than 25% of draws from the posterior of the hierarchical prior imply that consumers dislike tablets with larger amounts of cash back. Perhaps even more problematic, the posterior of the hierarchical prior suggests that consumers in the market prefer a tablet with 100€ cash back over the same tablet with 150€ cash back (as indicated by the stochastic dominance of 100€ cash back across all quantiles of the marginal posterior predictive distribution). In a market simulation, this could give rise to the odd outcome that tablets with smaller levels of cash back will be offered at higher prices, everything else equal. Finally, Table 18 (Appendix A.6) shows that the collection of individual level posterior means cuts the support for negative preferences for e.g., 50€ cash back by about 50%. Recall that what may appear as a benefit here is the consequence of measuring heterogeneity inconsistently. These observations call for a diligently constrained hierarchical prior distribution of heterogeneity in the population.

The majority of attributes and levels in Table 9 are such that one can expect every respondent to strictly prefer one level over another level, everything else equal. Table 10 collects all ordinal and sign constraints we thus impose in the hierarchical prior distribution, based on (direct) utility considerations. We constrain preferences for eleven out of the fourteen attributes. We do not impose constraints on brand, operating system, and display size. Although some brands may be preferred on average, it would be wrong to impose the average preference ordering for every respondent, similar with operating systems. Display size may appear as an ordinal attribute at first, but is not once the inconvenience of larger displays in some usage situations, or when transporting the tablet, are taken into account. As a consequence, we face a mix of constrained and unconstrained coefficients that we argue is characteristic of most applications of hierarchical models, at least in marketing and economics. We leverage the marginal-conditional decomposition of the hierarchical prior distribution developed in Section 3 to specify suitable subjective priors.

Table 10 Restricted attributes and constraints imposed on levels

We run the MCMC sampler using the tuned random walk proposal from Section 3 for R = 500,000 iterations and keep every 50th draw. We then burn-off the first 8000 draws and perform our analysis based on the remaining 2000 draws from the converged posterior distribution. We assess convergence by inspecting time-series plots of draws, both at the level of individual respondents and in the hierarchical prior. Here, we only report results for a model with a fully parametric, one-component hierarchical prior.Footnote 15

Figure 5 visually compares the marginal posterior predictive population densities of coefficients measuring preferences for levels of the cash back attribute.Footnote 16 The utility of the level ’no cash back’ is normalized to zero for identification, and individual preferences for 50€, 100€, and 150€ cash back are obtained as \(\beta _{\text {CB}_{50},i}=exp(\beta _{\text {CB}_{50},i}^{\ast })\), \(\beta _{\text {CB}_{100},i}=\beta _{\text {CB}_{50},i}+exp(\beta _{\text {CB}_{100},i}^{\ast })\), and \(\beta _{\text {CB}_{150},i}=\beta _{\text {CB}_{100},i}+exp(\beta _{\text {CB}_{150},i}^{\ast })\), respectively. This way, the coefficient measuring the preference for 50€ relative to no cash back is constrained to be positive, and coefficients associated with more cash back are constrained to be weakly larger than those associated with less cash back.

Fig. 5
figure 5

Posterior predictive population densities for the levels of the cash back attribute using posterior means and the posterior of the hierarchical prior

The upper left panel of Fig. 5 shows inferred population preference distributions for 50€ cash back relative to no cash back (the dash-dotted density in red). Now, if one imposes the constraints we use here and characterizes population preferences using individual level posterior means, the dashed blue density results. Because of the skewness of population preferences as a function of ordinal preferences, individual level posterior means now measure both mean preferences and heterogeneity in the population inconsistently. The mode is biased in the direction of the distribution’s skewness, i.e., in the direction of stronger preferences for 50€ cash back relative to the baseline. Compared to the population distribution implied by the posterior of the hierarchical prior, relying on the collection of individual level posterior means clearly underestimates the percentage of consumers with only weak preferences for 50€ cash back. The remaining two panels show how this bias persists, if not accentuates for 100€ and 150€ cash back.

Figure 6 illustrates inferred population preference distributions for display size 8 and 10. We see—in line with the illustration in Appendix A.1—that the collection of individual level posterior means underestimates the degree of taste heterogeneity for these two display sizes.

Fig. 6
figure 6

Posterior predictive population densities of display size 8 (left panel) and 10 (right panel) coefficients using posterior means and the posterior of the hierarchical prior

7.1 Predictive Performance and losses in profits

Next we illustrate the implications of these biases for predictive performance. We use the holdout log-likelihood (HLL) as a measure of how well the two forms of generalization predict choices of holdout respondents, i.e., individuals that were not part of the estimation sample. While it is common to report hit probabilities and hit rates, holdout log-likelihoods are the adequate measure if the eventual target is the prediction of market shares. The holdout likelihood (HL) of individual \(h \in \left \{1,\dots ,H\right \}\) is defined as the probability of observing the choices yhYhold implied by the model after fitting it to the training data Ytrain. When relying on the posterior of the hierarchical prior and the collection of individual level posterior means, the HL of individual h’s choices is defined as in Eqs. 19 and 20 , respectively. In each case \(HLL(Y_{\text {hold}})={\sum }_{h=1}^{H} ln(HL(y_{h}))\).

$$ HL(y_{h}) = \int{}MNL\left( y_{h} | g(\beta_{h}^{\ast})\right) p\left( \beta_{h}^{\ast}|\bar{\beta}^{\ast},V_{\beta^{\ast}}\right) p\left( \bar{\beta}^{\ast},V_{\beta^{\ast}}|Y_{\text{train}}\right)\ d\left( \beta_{h}^{\ast},(\bar{\beta}^{\ast},V_{\beta^{\ast}})\right) $$
(19)
$$ HL(y_{h})=\frac{1}{N}\sum\limits_{i=1}^{N} MNL\left( y_{h}|\hat{\beta}_{i}\right), $$
(20)

We evaluate the predictive performance of the population preference distributions inferred from the collection of individual level posterior means and the posterior of the hierarchical prior using five-fold cross validation. K-fold cross-validation is a common approach to compare the predictive performance of different models for model choice (see e.g., Bishop 2006). We split the complete set of N = 1046 choice vectors randomly into five disjoint subsets of approximately the same size. \(Y_{\text {train}}^{k}\) and \(Y_{\text {hold}}^{k}\) denote the k-th training and holdout sample, containing the data from about 800 (4 folds) and 200 (1 fold) respondents, respectively. The cross-validation estimator for the holdout log-likelihood is defined as the average of the holdout log-likelihoods across the five disjoint holdout data sets (Bengio and Grandvalet 2004):

$$ \begin{array}{@{}rcl@{}} \text{CV}_{\text{HLL}}(Y) &=& \frac{1}{K} \sum\limits_{k=1}^{K} \sum\limits_{y_{h} \in Y_{\text{hold}}^{k}} \text{HLL}\left( A(Y_{\text{train}}^{k}),y_{h}\right)\\ &=& \frac{1}{K} \sum\limits_{k=1}^{K} \text{HLL}\left( A(Y_{\text{train}}^{k}),Y_{\text{hold}}^{k}\right), \end{array} $$
(21)

\(\text {HLL}(A(Y_{\text {train}}^{k}),y_{h})\) denotes the predictive log-likelihood for holdout individual h in the k-th fold computed conditional on training data \(Y_{\text {train}}^{k}\) as input (see Eqs. 1920). The computations always use the same hierarchical Bayes model re-estimated using the respective training data, but summarized either using the collection of individual level posterior means, or the posterior of the hierarchical prior.

Table 11 summarizes the cross-validation results. A random guess for the choices of holdout respondents results in an average log-likelihood of − 3770 across our five folds of data. Thus, the hierarchical model yields a decent improvement relative to random predictions, regardless of how the model is summarized for predictions to choices by new respondents. In terms of the comparison between relying on the collection of individual level posterior means and the posterior of the hierarchical prior, the latter outperforms the former not only on average but also in every single fold.Footnote 17

Table 11 Predictive performance (holdout log-likelihoods, five-fold cross-validation) of different forms of generalization

Next we investigate the optimal product configuration for brand A. There are 460,800 product opportunities for brand A in this study. We assume that brand A a priori fixes the levels of some attributes in order to make this problem manageable in the context of varying cost scenarios. We assume that brand A only offers tablets with operating system A, 8-inch display, no SD slot, a 32GB memory card, no smartphone synchronization, and 50€ cash back. These assumptions reduce the action space to 360 unique product possibilities. For a market scenario, we assume that brands C, D, G are already in the market (Table 12).

Table 12 Specification of products offered by brand A’s competitors

To more generally capture differences between optimal actions implied by the different approaches of generalizing to the market, we specify a grid of possible costs. This grid comprises 20 different cost settings and is constructed as follows. First, costs are assumed to be the same for the weakest level of each attribute within each scenario. Within attributes, we assume that the cost difference between the baseline and (weakly) preferred levels is determined by a constant factor, i.e. \(c_{L2} = f * c_{L1},\ c_{L3} = f * 2 * c_{L1},\ c_{L4} = f* 3 *c_{L1},\ \dots \), for the levels of a priori ordered attributes; L1 is the least preferred level. We set f = 3 in this example and obtain 20 different scenarios by changing the cost of producing the least preferred levels {cL1} of the ordinal attributes to be optimized.

Table 13 summarizes the distribution of product-specific costs across the 360 product opportunities for the first, fifth, tenth, fifteenth and twentieth cost scenario. As can be seen, the grid includes both small as well as large absolute cost differences. In the first cost scenario, it is straightforward for brand A to offer a tablet combining the most attractive attribute levels, i.e., high resolution, 128GB, 2.2 Ghz, 8 − 12 hours battery, WLAN + LTE (4G), and a value pack, from the attributes to be optimized. As cost differences between attribute levels increase, it becomes less and less profitable to offer this high quality combination of attributes and we compute the expected loss caused by relying on a suboptimal form of generalization each time.

Table 13 Minimum, mean and maximum of product-specific costs illustrated for five cost scenarios

Table 14 summarizes the distribution of brand A’s expected percentage losses incurred by relying on the collection of individual level posterior means relative to inferred actions based on the posterior of the hierarchical prior ahp across cost scenarios. We find that optimization results that rely on the collection of individual level posterior means to represent market preferences are clearly inferior and the average percentage loss of 6.68% from using this latter method seems substantial.

Table 14 Percentage losses from using posterior means across cost scenarios relative to optimal actions from the posterior of hierarchical prior

8 Discussion

Models of consumer heterogeneity play a pivotal role in marketing and economics. Typical applications are random coefficients or mixed logit models for aggregate or panel data and hierarchical Bayesian models. Historically, statistical efficiency or computational arguments motivate the choice of heterogeneity model (e.g., Allenby and Ginter 1995 and Lenk et al. 1996). However, what can be learned about and subsequently extrapolated from the inferred heterogeneity distribution is limited by functional form assumptions such as e.g., the assumption of multivariate normally distributed preferences. For example, consistent estimates of the first and second moments, and correlations in the heterogeneity distribution—all which can be accomplished based on a multivariate normal prior—will fail to translate into useful market simulators in the context of highly non-normal distributions, e.g., distributions that are highly asymmetric.

Various semi-parametric formulations have been advanced (e.g., Lenk and DeSarbo 2000, Li and Ansari 2014 and Rossi 2014) to overcome the often unrealistic assumptions about higher moments inherent to the multivariate normal prior. The additional flexibility afforded by semi-parametric formulations is an important step towards more faithful prior population formulations. However, if as usual the parametric component in a semi-parametric model provides full prior support for all coefficients in a model, the semi-parametric model should still be considered a-theoretical and thus mis-specified from an economic point of view. For example, a mixture of normals a priori supports positive price coefficients and this support vanishes a posteriori only in limiting cases of little practical relevance.

The problem with standard, statistically motivated prior population distributions has been recognized in the academic literature early on (see the pioneering contribution by Boatwright et al. 1999), but no general solution has emerged. Recently, Allenby et al. (2014) introduced an informative subjective prior specification for log-normal hierarchical priors. These priors are easily implemented (compared to the truncated normal in Boatwright et al. 1999), but require the analyst to depart from the standard weakly informative subjective prior settings in hierarchical models (e.g., Rossi et al. 2005). In the common situation where the heterogeneity distribution comprises both constrained and unconstrained coefficients (e.g., brand and price coefficients), the choice of subjective prior parameters is an unresolved problem for which this paper proposes a solution.

The contribution of this paper is a marginal-conditional decomposition of the population distribution that allows researchers to be informative about constrained parameters, on a logarithmic scale, while retaining maximal flexibility regarding the (conditional) hierarchical prior of unconstrained coefficients. The suggested specification is easily implemented and the additional computational effort is minimal.

Our specification becomes essential whenever the heterogeneity distribution comprises both constrained and unconstrained coefficients such as e.g., in heterogeneous or mixed choice models that feature brand coefficients and a price coefficient. Finally, we develop how to tune individual level proposal densities for numerically efficient MCMC inference in the presence of sign- and order-constraints. This generalization of pre-tuned proposal densities (Rossi et al. 2005) is particularly important in high dimensional models that feature a multiplicity of constraints.

We thus overcome the choice between a mis-specified heterogeneity distribution and a the common ad-hoc use of the collection of individual level means that fail to measure heterogeneity consistently. The marginal-conditional decomposition developed in this paper facilitates the formulation of more economically faithful heterogeneity distributions based on prior constraints, broadening the applicability of hierarchically formulated choice and demand models in marketing and economics.

An aspect of the subjective prior for order constrained coefficient that we have not explored in this paper, but plan to investigate in future research, is that of prior scale differences and dependence between coefficients for an ordinally constrained attribute. It is easy to verify by simulation that prior scale differences and dependence can be used to express structured beliefs about heterogeneity in ordinal preferences. For example, the population could be heterogeneous in their valuation of a lower level of an ordinal attribute but relatively homogeneous in incremental preferences for the next higher level. Alternatively, the population could exhibit substantial heterogeneity in the incremental valuation of the next higher level. Finally, the amount of heterogeneity in the increment could be correlated with the valuation of the lower level, such that low, medium, or high valuations of the lower level co-occur with relatively more heterogeneity in the incremental valuation of the higher level.

Last but not least, it could be interesting to compare (a mixture of) multivariate truncated normal distributions to the log-normal prior formulation used in this paper. The recently proposed exchange algorithm can handle the “double-intractability” due to the intractable normalization of the truncated multivariate normal (Möller et al. 2006; Murray et al. 2006; see Kosyakova et al.(2020) for a recent adaptation of the exchange algorithm in marketing).