1 Introduction

Non-response is a major problem for anyone collecting and processing data, such as National Statistical Institutes (NSIs). When left untreated, non-response can lead to biased estimates or invalid results from statistical analyses. Non-response can be subdivided into item non-response, where some values from otherwise observed units are missing, and unit non-response, where entire units are not observed.

A commonly used technique to deal with missing data is imputation (see, e.g., Rubin 1987, Schafer 1997, Little and Rubin 2002, De Waal et al. 2011 and Van Buuren 2012). In imputation, missing values are estimated and filled in into the dataset. Imputation is particularly used often for item non-response. It is sometimes also used for unit non-response, although weighting is a more common technique for correcting for unit-nonresponse.

Imputation can become challenging if the variable to be imputed has to comply with a known total. This situation occurs often if the total has been published before. Deviation from an earlier published total is deemed undesirable by many NSIs as this may lead to possible confusion among users due to conflicting results for the same phenomena. Even more challenging is the case when several variables in the same dataset need to be imputed. In addition to the known totals there can be so-called edit rules (or edits for short) that have to be satisfied by the data. An example of an edit is that a baby cannot have completed primary school.

We illustrate the problem with an example. Statistics Netherlands publishes information on the highest educational level attained. In the Netherlands, the first results on the highest educational level attained are based on weighting the so-called Education Attainment File (EAF). Later, information based on the highest educational level attained is combined with other data sources to construct a virtual population census, i.e. a population census that is mainly based on administrative data covering the entire population. In the case of the Dutch Population Census, all variables except highest educational level attained and occupation are based on integral administrative data. After construction of the virtual population census, we can break down information on the highest educational level attained into detailed groups of the population by using the background information available in the population census. To facilitate the estimation process for the virtual population census, highest educational level attained is mass imputed in the virtual population census, i.e. highest educational level attained is imputed for all population units for which no value has been observed. In that way, a complete dataset for the entire Dutch population is constructed, which can be used for multiple estimation purposes. Daalmans (2017) proposes a method for mass imputation of highest educational level attained based on logistic regression that can be used for the Dutch Population Census. However, the results for highest educational level attained based on the population census will deviate from the earlier published results based on weighting the EAF if standard imputation techniques, such as logistic regression, are used. Besides highest educational level attained, other variables, in particular occupation, have missing values and need to be imputed.

As far as we are aware, only one method has thus far been proposed in the literature that allows one to impute categorical data with missing values for multiple variables such that previously published totals are preserved and specified edits are satisfied (De Waal et al. 2017). However, that method is very time-consuming, and can only be applied to relatively small problem instances. In some cases, the method also has to “backtrack”, i.e. a previously imputed variable may need to be imputed in a different way. As noted by De Waal et al. (2017), this would lead to an even more time-consuming and extremely complicated process.

In the current paper, we propose an imputation approach that can be used for a broad class of imputation methods for multivariate categorical data such that previously published totals are preserved by the imputed data while edits are satisfied, and that can handle much larger problem instances than the approach by De Waal et al. (2017). This imputation approach is based on adding a calibration step to standard imputation techniques. It can be used in combination with any imputation model that estimates imputation probabilities, i.e. the probability that imputation of a certain category for a variable in a certain unit leads to the correct value for this variable and unit. Examples of imputation models that estimate such imputation probabilities are multinomial models and logistic regression models.

Our proposed imputation approach generalizes an approach by Favre et al. (2005). In their approach, only one categorical variable is to be imputed subject to edits and known totals. We generalize this to the multivariate case where multiple categorical variables have to be imputed subject to edits and known totals. We achieve this by adopting a fully conditional specification approach (see Subsect. 3.2) that takes the previously published totals into account in combination with a Fellegi–Holt approach to satisfy all edits (see Fellegi and Holt 1976 and the Subsect. 3.1 of the current paper). Whereas Favre et al. (2005) approximate imputation probabilities given that known totals have to be preserved in only one way, we also examine two alternative approximations (see Sect. 3.3). In this paper, we will assume that all population units will be imputed, i.e. that mass imputation is used, and that therefore no weighting is necessary to obtain estimates for population totals.

Section 2 of this paper first discusses the approach developed by Favre et al. (2005) for a single categorical variable with missing data. Section 3 discusses our proposed generalization to multivariate categorical missing data. Section 4 describes the evaluation study that we carried out to assess the proposed generalization, while Sect. 5 examines the results of this study. Section 6 examines the estimation of imputation variance by means of a pseudo-population bootstrap approach. Section 7 concludes the paper with a short discussion.

2 Approach by Favre, Matei and Tillé for univariate missing data

The approach of Favre et al. (2005) for imputation of univariate missing categorical data subject to edits and known totals consists of four steps. In the first step, user-specified edits are used to find structural zeroes for the variable to be imputed, i.e. for each record in the dataset the categories that are not allowed according to the observed values in combination with the specified edits are determined. In the second step, for each record the imputation probabilities of the categories that are allowed are estimated. In the implementation of Favre et al. (2005) this is done by assuming a multinomial logistic model, taking the structural zeroes into account. In the third step, these probabilities are calibrated so that, for each category, they sum up to the corresponding known total and, for each record, to one. This is achieved by using iterative proportional fitting (IPF). In the fourth step, Cox’ controlled rounding algorithm (see Cox 1987) is used to fix one of the probabilities for the allowed categories per record to one and the probabilities of the other allowed categories to zero. The category for which the probability is set to one for a certain record is imputed in that record.

We illustrate the basic ideas of the approach by Favre et al. (2005) by means of an example. In this example one variable with three categories is to be imputed in eight units. In the first step of the approach, the observed values for the other variables are filled in into the edits to find the structural zeroes for the variable to be imputed. For convenience, we assume that this step has already been carried out for all units in the dataset.

The units to be imputed are given in Table 1. The known totals are given in the last row.

Table 1 Units to be imputed and totals per category

The cells with a “*” are allowed to be imputed. The zeroes in units one, three, five, six and eight in Table 1 denote values that are not allowed to be imputed due to the specified edits, i.e. the structural zeroes. As explained above, imputation means that in each unit one of the “*” is replaced by a one, and the other “*” by zeroes.

The value to be imputed in a certain unit is essentially found by randomly drawing one of the allowed values using the imputation probabilities obtained from the imputation model. We assume that, taking into account that certain categories are not allowed in certain units, the imputation probabilities obtained by the imputation model are as in Table 2 (Step 2).

Table 2 Imputation probabilities

In Table 2, we see, for instance, that, according to the assumed imputation model, the probability that the actual value of the variable to be imputed in unit one equals \({c}_{2}\) is 0.4. In Step 3 of their approach, Favre et al. (2005) apply IPF to the imputation probabilities in Table 2 so that each row sums up to one and each column to the corresponding total. This step is needed to take the known totals into account. The adjusted probabilities are given in Table 3, together with the totals per category in the last row.

Table 3 Adjusted imputation probabilities and totals per category

Using IPF to adjust the imputation probabilities actually only leads to an approximation for the exact imputation probabilities that take known totals and edits into account. In general, it is very complicated and/or time-consuming to compute the exact probabilities. We will return to this point in Subsect. 3.3.

In Step 4 of their approach Favre et al. (2005) apply Cox’s controlled rounding algorithm (Cox 1987) to ensure that each record with a missing value for the target variable is imputed and the known totals are preserved. Cox’ controlled rounding algorithm is a stochastic procedure. In the rounding algorithm, a rounding base \(b\) is specified. All entries in the table, including the marginal totals, are rounded to integer multiples of \(b\). In the approach by Favre et al. (2005), the rounding base \(b\) equals 1. Since all internal entries are (adjusted) imputation probabilities that lie between zero and one, all internal entries are rounded to either zero or one. Entries that already are multiples of \(b\) are not changed by Cox’ algorithm. This implies that the marginal totals, which are non-negative integers, are not changed by Cox’ method. An appealing property of Cox’s controlled rounding algorithm is that it preserves additivity of the table. That is, the rounded internal entries sum up to the rounded marginal totals. This implies that, since the entries in each row sum up to one, exactly one entry per row will be rounded to one, and will hence be imputed. All other entries in a row will be set to zero. Another appealing property of Cox’ controlled rounding algorithm is that it is unbiased. That is, if we were to repeat the rounding algorithm an infinite number of times, the averages of the rounded cell values over the repetitions would be equal to the original unrounded cell values, i.e. the adjusted imputation probabilities given in Table 3. In other words, the categories that are imputing are drawn according to the adjusted imputation probabilities in Table 3. If we apply Cox’ controlled algorithm to Table 3, we may, for instance, obtain Table 4.

Table 4 Imputed units

3 Generalization to multiple variables with missing data

Our generalization of the approach of Favre et al. (2005) to multiple categorical variables with missing data consists of two different phases:

  • A start-up phase where we impute all variables for the first time. This phase serves two different purposes. First, it gives us “rough” imputations, which are later improved upon in the second phase. Second, and more importantly, the imputed dataset after this phase satisfies all edits. A dataset that satisfies all edits is a prerequisite for the method in the second phase, which achieves consistency with the totals. In the start-up phase we do not preserve known totals yet.

  • The actual imputation phase where we iteratively re-impute all data that were originally missing. This phase also serves two purposes. First, in this phase we improve the imputations after the first phase. Second, in this phase we preserve known totals by calibrating the imputations to these totals.

3.1 The start-up phase

In the start-up phase we use sequential imputation, i.e. we impute each record in turn and within each record we impute each variable in turn. For each record with missing values, we apply the following steps for each variable to be imputed.

1S. For the current variable to be imputed, we use a – usually rather simple – imputation model, for instance, a multinomial imputation model involving only a few auxiliary variables, to estimate imputation probabilities for its categories.

2S. For the current variable to be imputed, we derive all allowable categories, given the observed and already imputed variables in the record under consideration and the edits.

3S. We impute the current variable to be imputed by drawing categories using the imputation probabilities from Step 1S until we draw an allowable category.

Step 2S is a fundamental step as it ensures that after the start-up phase we will have a fully imputed dataset satisfying all specified edits, which we need as a starting point for the actual imputation phase. It is also a non-trivial step, and may in fact be the most complicated step of our entire imputation approach from a technical point of view.

We first illustrate the problem with sequentially imputing data having to satisfy edits. Suppose we have a dataset with three variables: Marital status, Age and Relation to head of household. The possible values of Marital status are “Married”, “Unmarried”, “Divorced” and “Widowed”, of Age “ < 16 years” and “ ≥ 16 years", and of Relation to head of household “Spouse”, “Child” and “Other”. Suppose we have two user-specified edits: the first edit saying that someone who is less than 16 years cannot be married, and the second one that someone who is not married cannot be the spouse of the head of household. The first edit involves variables Age and Marital status, and the second edit variables Marital status and Relation to head of household. Now suppose that both Marital status and Age are missing in a certain record, the observed value of Relation to head of household in that record is “Spouse”, and that Age is the current variable to be imputed. Neither of the two user-specified edits involves both the observed variable, Relation to head of household, and the current variable to be imputed, Age. That is, neither of these two edits prevents us from imputing the value “ < 16 years” for Age. However, if we were to do that, we would later notice – while trying to impute Marital status – that there is no value for Marital status that will satisfy both edits, since a person younger than 16 years cannot be married while someone who is a spouse of the head of household has to be married. So, the problem is that we have to take into accounts edits involving variables to be imputed later while imputing the current variable.

We will now sketch the procedure in Step 2S to overcome this problem and then illustrate this procedure by means of the above example. First of all, we fill in the values of the observed and already imputed variables (if any) in the record under consideration into the user-specified edits. This leads to a set of edits for the remaining variables to be imputed. The main idea of the approach in Step 2S is to eliminate all remaining variables to be imputed except the current variable to be imputed from this set of edits by means of the Fellegi–Holt elimination method (see Appendix A). When a variable is eliminated by means of the Fellegi–Holt elimination method, a new set of edits is obtained that has to be satisfied by all remaining variables to be imputed. By repeated elimination, we obtain a set of edits that only involves the current variable to be imputed. A set of edits for a single categorical variable simply defines a set of allowed values for that variable. So, from the derived set of edits for the current variable to be imputed we can immediately see the allowable values for that variable.

The Fellegi–Holt elimination method has the nice and very important property that if and only if the new set of edits for the remaining variables to be imputed obtained after elimination of a variable can be satisfied, a value for the eliminated variable exists such that the set of edits before elimination can also be satisfied (for details and proofs see Fellegi and Holt 1976 and De Waal and Quere 2003). By repeated application of this property, we find that if and only if the current variable to be imputed is imputed such that the set of edits obtained after elimination of all other variables remaining to be imputed is satisfied, all eliminated variables can be imputed such that all user-specified edits can be satisfied (again see De Waal and Quere 2003). In turn this implies that we can impute the variables with missing values in a record sequentially, i.e. one at the time, while still ensuring that all edits can be satisfied.

In mathematical logic, the Fellegi–Holt elimination method is known as (multivalent) resolution (see, e.g., Chandru and Hooker 1999 and Hooker 2000). Resolution can be used to check whether a set of propositions (the edits in our case) can be satisfied.

We now briefly illustrate the procedure in Step 2S by means of our example. We first introduce some notation. We denote the number of variables by \(n\). In the case of categorical data, an edit \(k\) is usually written in so-called normal form, i.e. as a Cartesian product \({F}_{1}^{k}\times {F}_{2}^{k}\times \dots \times {F}_{n}^{k}\) of non-empty sets \({F}_{s}^{k}\) \((s=\mathrm{1,2},...,n),\) meaning that if for a record with values \(\left({v}_{1}, v, \dots ,{v}_{n}\right)\) we have \({v}_{s}\in {F}_{s}^{k}\) for all \(s=\mathrm{1,2},...,n\), then the record fails edit \(k\), otherwise the record satisfies edit \(k\).

In normal form the edit saying that someone who is less than 16 years cannot be married can be written as

$$\left\{ {{\text{Married}}} \right\} \, \times \, \left\{ { < \, 16\;{\text{years}}} \right\} \, \times \, \left\{ {{\text{Spouse}},\;{\text{Child}},\;{\text{Other}}} \right\},$$
(1)

and the edit saying that someone who is not married cannot be the spouse of the head of household as

$$\left\{ {{\text{Unmarried}},\;{\text{Divorced}},\;{\text{Widowed}}} \right\} \times \left\{ { < \, 16\;{\text{years}},\; \ge 16\;{\text{years}}} \right\} \times \left\{ {{\text{Spouse}}} \right\}$$
(2)

We fill in the value “Spouse” for Relation to head of household into edits (1) and (2) and obtain the edits

$$\left\{ {{\text{Married}}} \right\} \, \times \, \left\{ { < {\text{ 16 years}}} \right\}$$
(3)

and

$$\left\{ {{\text{Unmarried}},{\text{ Divorced}},{\text{ Widowed}}} \right\} \, \times \, \left\{ { < {\text{ 16 years}}, \, \ge {\text{16 years}}} \right\}$$
(4)

for the variables to be imputed, Marital status and Age. (For notational convenience, in edits, we do not mention variables whose values have been substituted into a set of edits nor variables that have been eliminated).

Since Age is the current variable to be imputed in our example, we have to eliminate variable Marital status from (3) and (4). We obtain the edit

$$\left\{ { < {\text{ 16 years}}} \right\}$$
(5)

for variable Age (see Appendix A).

Edit (5) implies that only the value “ ≥ 16 years” is allowed to be imputed for variable Age. Indeed, if we later impute the value “Married” for Marital status, we obtain an imputed record satisfying both user-specified edits (1) and (2).

3.2 The actual imputation phase

In the actual imputation phase we iteratively (re-)impute all data that were originally missing, using the other (partly imputed) data as auxiliary data. We essentially apply a fully conditional specification (FCS) approach for imputation of multivariate missing data (see, e.g., Raghunathan et al. 2001, Van Buuren and Groothuis-Oudshoorn 2011 and Rubin 2003). In FCS, one specifies a separate multivariate imputation model for each variable to be imputed. In principle, all other variables can be used as auxiliary variables in the imputation model for a certain variable. In the estimation process of the model parameters, as well as in the actual imputation process, previously imputed values of the auxiliary variables are used.

FCS is usually applied for Bayesian versions of multiple imputation, where one specifies a prior distribution for each imputation model, derives the posterior distribution given the prior distribution and the observed data, and then draws multiple imputations from the posterior distribution. In this paper, we will apply FCS in a frequentist context, i.e. without specifying prior distributions for the imputation models. Our approach is similar to the approach by Siddique and Belin (2008). However, whereas Siddique and Belin (2008) use a hot-deck imputation method without an explicit imputation model for each variable to be imputed, we will use multinomial imputation models in our simulation study.

In the actual imputation phase we start with the fully imputed dataset obtained after the start-up phase. We iteratively (re-)impute the data that were originally missing, where we impute each variable in turn. For each variable to be imputed we apply the following steps.

1I. For the current variable to be imputed, we use an imputation model, for instance, a multinomial imputation model with in principle all other variables as auxiliary data, to estimate imputation probabilities for its categories.

2I. For each record in which the value of the current variable to be imputed was originally missing, we fill in the other data in that record (either observed or imputed) into the edits. This may lead to some structural zeroes, i.e. categories that are not allowed to be imputed, in some records for the current variable to be imputed.

3I. For the current variable to be imputed, we approximate the correct imputation probabilities per record, i.e. we use the probabilities from the imputation model of Step 1I and adjust them, taking structural zeroes and – if applicable – known totals into account. For each record to be imputed, taking structural zeroes into account simply amounts to setting imputation probabilities for those structural zeroes to zero and rescaling the other imputation probabilities so they sum up to one. Subsect. 3.3 discusses how to adjust imputation probabilities so they take known totals into account.

4I. If the totals of the current variable to be imputed are not known, we use the adjusted imputation probabilities from Step 3I to draw imputations. If the totals of the current variable to be imputed are known, we use Cox’ controlled random rounding algorithm to find the imputations.

We keep on iterating the above steps 1I to 4I for all variables until the distribution of the imputed data has converged to a final distribution.

Note that, in contrast to Step 2S of the start-up phase, Step 2I is quite simple. The reason why Step 2I is so simple from a technical point of view is that Step 2S of the start-up phase guarantees after the start-up phase we will have an imputed dataset satisfying all edits. Due to this, Step 2I in the actual imputation phase only has to ensure that we never impute a value for the current variable to imputed that is not allowed according to the other data in that record and the edits.

3.3 Adjusting the imputation probabilities

As we already mentioned in Sect. 2, using IPF to adjust the imputation probabilities leads to an approximation for the exact imputation probabilities that take edits and known totals into account. In this section we discuss two other approximations. We assume that the imputation probabilities from the posited imputation model already take edits into account.

In principle, the exact imputation probabilities taking known totals in account can be found by enumerating all possibilities. To illustrate how, we consider Table 5 below. We suppose that we want to impute a variable \(x\) with three categories (\({c}_{1}\), \({c}_{2}\) and \({c}_{3}\)) in four units. We assume that there are no structural zeroes, we know the totals of the categories and we have estimated imputation probabilities for the four units.

Table 5 Estimated imputation probabilities and known totals

The exact adjusted imputation probabilities are given by \(\mathrm{Pr}\left({x}_{i}=k|\text{known totals}\right)\), where \({x}_{i}\) is the value of \(x\) in unit \(i\) \((i=1,\dots ,4)\) and \(k={c}_{1}\), \({c}_{2}\) or \({c}_{3}\). If we were to disregard the totals, there would be \(3\times 3\times 3\times 3=81\) possible outcomes, and the exact imputation probabilities would be given by the values in Table 5. However, if we do take the known totals into account, there are only 12 possible outcomes. These are given in Table 6 below, together with their corresponding probabilities according to Table 5.

Table 6 Possible outcomes, given known totals

For instance, \(\mathrm{Pr}\left({\text{outcome}}=1\right)=\mathrm{Pr}\left({x}_{1}={c}_{1}\right)\times \mathrm{Pr}\left({x}_{2}={c}_{1}\right)\times \mathrm{Pr}\left({x}_{3}={c}_{2}\right)\times \mathrm{Pr}\left({x}_{4}={c}_{3}\right)\approx 0.0313\) according to Table 5. Using Table 6 we can easily calculate \(\mathrm{Pr}\left({x}_{i}=k|\text{known totals}\right)\) for \(i=1,\dots ,4\) and \(k={c}_{1}\), \({c}_{2}\) or \({c}_{3}\). For example,

$$\mathrm{Pr}\left({x}_{1}={c}_{1}|\text{known totals}\right)=\frac{0.0313+0.0078+0.0313+0.0156+0.0078+0.0156}{0.1445}\approx 0.7568,$$

where 0.1445 is the sum of all the probabilities over all 12 possible outcomes.

In this very small example we can enumerate all possible outcomes. In general, enumeration is infeasible, and we have to resort to approximating the adjusted imputation probabilities. Besides the IPF approach as used by Favre et al. (2005), we have tested two alternative approximations in our simulation study.

The underlying idea of the alternative approximations is that in each unit in which the value of the current variable to be imputed is missing, we have to decide whether we are going to impute category \(c\) or not.

If we only consider a certain category \(c\) and assume that we can neglect the other categories, we would have to draw from a so-called Conditional Binomial (\({\text{CB}}\)) distribution (see, e.g., Chen and Liu 1997 and Chen 1998). The \({\text{CB}}\) distribution is closely related to a Poisson-Binomial (\({\text{PB}}\)) distribution. Suppose we have a dataset \(S\) with \(\left|S\right|\) units and let \({\varvec{Z}}=\left({Z}_{1},\dots ,{Z}_{\left|S\right|}\right)\) be the outcomes of \(\left|S\right|\) independent Bernouilli trials with probabilities \({p}_{1},\dots ,{p}_{\left|S\right|}\). The probability of a total of \(k\) successes in the \(\left|S\right|\) trials is then given by a \({\text{PB}}\) distribution, which we denote as \({\text{PB}}\left(k;\, {{\varvec{p}}}_{S}\right)\) where \({{\varvec{p}}}_{S}=\left({p}_{1},\dots ,{p}_{\left|S\right|}\right)\) is a vector of parameters of the \({\text{PB}}\) distribution. The conditional distribution of \({\varvec{Z}}\) given that \(\sum_{i\in S}{Z}_{i}=k\) is a \({\text{CB}}\) distribution, with parameters \(k\) and \({{\varvec{p}}}_{S}\).

We will approximate the \({\text{CB}}\) distribution for our situation. Suppose there are \({m}_{x}\) units (numbered 1 to \({m}_{x}\)) in which the value of the variable \(x\) under consideration is missing and we have to impute category \(c\) \({k}_{c}\) times. Together these \({m}_{x}\) units form a set \({S}_{x}\). We consider a unit \(i\) in which the value of the variable under consideration is missing. The exact adjusted probability for imputing category \(c\) in unit \(i\) is given by

$${p}_{ic}^{*}=\frac{{p}_{\mathit{ic}}\times \mathrm{Pr}\left({{\text{category}}\, c \, \text{is selected} \,(k}_{c}-1\right)\, \text{times in the remaining}\, \left({m}_{x}-1\right)\, {\text{units}}) }{\text{Pr(}{{\text{category}}\, c\; \text{is s}\,{\text{elected}}\, k}_{c} \text{ times in }{m}_{x}\, {\text{units}})}$$

if \({k}_{c}\ne 0\), and \({p}_{ic}^{*}=0\) if \({k}_{c}=0\), where the \({p}_{ic}\) are the imputation probabilities without taking the known totals into account and the \({p}_{ic}^{*}\) are the adjusted imputation probabilities that do take the known totals into account, i.e. the \({p}_{ic}^{*}\) are the probabilities of our \({\text{CB}}\) distribution. That is,

$$\begin{aligned}{p}_{ic}^{*}& = \frac{{p}_{\mathit{ic}}\mathrm{PB}\left({(k}_{c}-1\right);\,{{\varvec{p}}}_{{S}_{x}\backslash \{i\}}) }{{p}_{\mathit{ic}}\mathrm{PB}\left({(k}_{c}-1\right);\,{{\varvec{p}}}_{{S}_{x}\backslash \{i\}})+(1-{p}_{ic})\text{PB(}{k}_{c};\,{{\varvec{p}}}_{{S}_{x}\backslash \{i\}})}\\ & = \frac{{p}_{ic}}{{p}_{ic}+(1-{p}_{ic})\frac{{\text{PB}}({k}_{c};\,{{\varvec{p}}}_{{S}_{x}\backslash \{i\}})}{\mathrm{PB}\left({(k}_{c}-1\right);\,{{\varvec{p}}}_{{S}_{x}\backslash \{i\}}\})}}\end{aligned}$$
(6)

In this paper we use simple approximations for the \({\text{PB}}\) distribution. As a reviewer pointed out, a \({\text{PB}}\) distribution can quite efficiently be computed exactly by means of a Fast Fourier Transform (Hong 2013), which is implemented in the “poisbinom” package. We have not used this exact approach for two reasons. A pragmatic reason is that even with an efficient approach to calculate a \({\text{PB}}\) distribution exactly, our simulation study – where we had to calculate/approximate a \({\text{PB}}\) distribution millions of times (see Sect. 4) – is likely to take much computing time. Even with our simple approximations our simulation study took several months. A more important theoretical reason is that using a \({\text{CB}}\) distribution – and implicitly a \({\text{PB}}\) distribution – is itself an approximation, since the assumption that we can neglect the other categories when drawing a category is not completely correct. The estimated imputation probabilities and known totals for the other categories affect the adjusted imputation probabilities for the category \(c\) under consideration. For instance, using the exact \({\text{PB}}\) distribution in Eq. (6) gives \(\mathrm{Pr}\left({x}_{1}={c}_{1}|\text{known totals}\right)=0.7596\) for Table 5, which differs slightly from the correct value obtained by enumeration mentioned above.

Our first simple approximation of \(\text{PB(}{k}_{c};\, {{\varvec{p}}}_{{S}_{x}\backslash \{i\}})\) is by a Poisson distribution \(\mathrm{Pr}({\lambda }_{{S}_{x}\backslash \{i\}})\), with \({\lambda }_{x\backslash \{i\}}=\sum_{t\in {S}_{x}\backslash \{i\}}{p}_{tc}\) (see, e.g., Chen et al. 1994, Chen and Liu 1997, Chen 1998 and Chen 2000). Using this approximation when \({k}_{c}\ne 0\), we get \(\frac{PB({k}_{c};\, {{\varvec{p}}}_{{S}_{x}\backslash \{i\}})}{\mathrm{PB}\left({(k}_{c}-1\right);\, {{\varvec{p}}}_{{S}_{x}\backslash \{i\}}\})}\approx \frac{{\lambda }_{{S}_{x}\backslash \{i\}}}{{k}_{c}}\).

A second, alternative approximation of \(\text{PB(}{k}_{c};\, {{\varvec{p}}}_{{S}_{x}\backslash \{i\}})\) is by a binomial distribution with success probability given by \({p}_{{S}_{x}\backslash \{i\}}(c)=\sum_{t\in {S}_{x}\backslash \{i\}}{p}_{tc}/|{S}_{x}\backslash \{i\}|=\sum_{t\in {S}_{x}\backslash \{i\}}{p}_{tc}/({m}_{x}-1)\) for category \(c\) (see, e.g., Chen and Liu 1997). Using this approximation when \({k}_{c}\ne 0\), we get \(\frac{PB({k}_{c};\, {{\varvec{p}}}_{{S}_{x}\backslash \{i\}})}{\mathrm{PB}\left({(k}_{c}-1\right);\, {{\varvec{p}}}_{{S}_{x}\backslash \{i\}}\})}\approx \frac{{m}_{x}-{(k}_{c}-1)}{{k}_{c}}\frac{{p}_{{S}_{x}\backslash \{i\}}}{(1-{p}_{{S}_{x}\backslash \{i\}})}\).

4 Evaluation study

4.1 Implementation

We implemented our code in R (see R Core Team 2020). In the start-up phase, we simply used the observed fractions of the categories of each variable to be imputed as imputation probabilities, unadjusted for known totals. For instance, if 49% of the observed people are female and 51% male, then we impute “Female” with probability 0.49 and “Male” with probability 0.51 in variable Gender. In the imputation phase, we have used a more complicated multinomial model. For each variable we have used all other (partly imputed) variables as auxiliary variables while estimating the imputation probabilities according to this model. We have implemented the estimation of the imputation probabilities using the function “multinom” from the R package “nnet”. This function uses neural networks to fit multinomial models given a set of auxiliary variables to the available complete data, and thus obtains estimates for the imputation probabilities for a target variable (for more details, see Ripley 2020). In our imputation approach we need to handle edits. That is, we need to substitute values into edits and we need to derive implied edits by eliminating variables (see Sect. 3.2). For this we have used the R package “editrules” (see De Jonge and Van der Loo 2012). R code for Cox’s controlled rounding algorithm was kindly provided to us by our colleague Sander Scholtus (Statistics Netherlands). We have gratefully used this code in our implementation. Our R code is available on GitHub: https://github.com/tonwaal/Calibrated-Imputation-for-Multivariate-Categorical-Data.

4.2 Dataset

The dataset used for our simulation study is a small part of the Dutch Population Census 2001 published by Statistics Netherlands. The dataset contains 3,784 randomly drawn individuals with 12 categorical variables. The variables and the corresponding numbers of categories are: Gender (2 categories), Age (17 categories), Position in the household (8 categories), Size of the household (6 categories), Residential area last year (3 categories), Nationality (3 categories), Country of birth (3 categories), Educational level (7 categories), Economic status (8 categories), Occupation (10 categories), NACE code (13 categories) and Marital status (4 categories). Descriptions of the categories of these variables can be found in Appendix B. Appendix C contains a description of the user-specified edits. The dataset used is complete and does not contain any missing values for any of the units. We treated this dataset as our population. In our simulation study we introduced missingness into the data and mass imputed the missing values in the population.

In our study, we focused on the variables Educational level and Occupation, since these are the only variables in the Dutch Population Census that are (partly) based on surveys, rather than on administrative data covering (almost) the entire population as is the case for the other variables. In the simulation study, we assumed that only the categories of variable Educational level have to sum up to known totals for our population. This mimics the situation for the Dutch Population Census, where estimated totals for Educational level are known from the Educational Attainment File and mass imputation is planned to be used. The true totals of Educational level, respectively Occupation, are given in Tables 7 and 8.

Table 7 True totals of Educational level
Table 8 True totals of Occupation

4.3 The simulation study

For our simulation study we introduced missingness in individual data items in the dataset described in Subsect. 4.2. In general, three kinds of missing data mechanisms are distinguished in the literature: Missing Completely At Random (MCAR), Missing At Random (MAR) and Not Missing At Random (NMAR) (Rubin 1976). Roughly speaking, a missing data mechanism is MCAR, if the reason for a value being missing does not depend on the value itself, nor on values of background variables. A missing data mechanism is MAR, if the reason for a value being missing does not depend on the value itself, but does depend on values of background variables. A missing data mechanism is NMAR, if the reason for a value being missing depends on the value itself, even after correcting for the background variables. MCAR is the simplest case to deal with. Many imputation methods are able to correct for this situation. MAR is more complicated than MCAR. One can correct for a MAR mechanism by taking appropriate background variables into account in the imputation process. NMAR is the most difficult case by far. One can only correct for this case by relying on assumptions that cannot be tested from the dataset with missings itself.

In this paper we focus on MAR mechanisms, since these are the most important missingness mechanisms in practice. Sometimes we will also refer to results for MCAR mechanisms. Those results are given in the Supplementary Material to this paper. For more results for MCAR and NMAR mechanisms we refer to De Waal and Daalmans (2019).

In our simulation study, we examined two different fractions of missingness (referred to as Low and High) and three approximation methods for the correct imputation probabilities (the binomial approximation, the Poisson approximation and the IPF approximation).

In a preliminary study we found that the number of iterations in the actual imputation phase (see Sect. 3.2) affects the results as expected, and that after ten iterations our imputation approach has converged to (near) optimal results (see also De Waal and Daalmans 2019). Setting the number of iterations to ten appears to be a good trade-off between quality of the obtained results and the required computing time for our data. In our simulation study we therefore set the number of iterations in the actual imputation phase to ten.

We took into account that in practice values for some variables will be missing (far) more often than for other variables. For instance, values for Educational level and Occupation will be missing quite often in the Dutch Population Census, whereas values of, for example, Gender will be missing only very rarely.

The stochastic process to create missingness in a variable is independent from the missingness process for any of the other variables. For the MAR mechanisms, we examined situations where the missingness of Educational level or the missingness of Occupation depends on the age class of the person. In particular, we assumed that Educational level or Occupation are observed more often for people in a younger age class than for people in an older age class. For Educational level this reflects the current situation in the Netherlands, where the Educational level of younger people is more frequently available in administrative datasets than for older people. We defined a “young” class consisting of people up to 29 years (categories 1 to 6 of Age), a “middle” class consisting of people from 30 up to 54 years (categories 7 to 11 of Age), and an “old” class consisting of people of 55 years and older (categories 12 to 17 of Age). The “young” class consists of 1,285 persons, the “middle” class of 1,714 persons, and the “old” class of 785 persons.

For the MAR mechanisms for Educational level, we assumed that the missing data mechanism for each of the other variables, including Occupation, is MCAR. Similarly, for the MAR mechanism for Occupation, we assumed that the missing data mechanisms for each of the other variables, including Educational level, is MCAR. The numbers of missings that we created for the MAR mechanism for Educational level for each variable in all three age classes for the Low missingness scenario are given in Table 9. For Educational level the missingness percentages are: 12.5% for the “young” class, 27.5% for the “middle” class and 40% for the “old” class. For High missingness, the numbers of missings are twice as high as in Table 9.

Table 9 Numbers of missings for the MAR mechanism for educational level and Low missingness

The numbers of missings that we created for the MAR mechanism for Occupation are the same as in Table 9, except that the number of missings for Educational level and Occupation are interchanged, i.e. the numbers of missings for Educational level are 321 for the “young” class, 429 for the “middle” class and 196 for the “old” class, and for Occupation 161 (12.5%) for the “young” class, 471 (27.5%) for the “middle” class and 314 (40%) for the “old” class. Again, for High missingness, the numbers of missings are twice as high.

For all four MAR missingness mechanisms (MAR for Educational level with Low and High missingness and MAR for Occupation with Low and High missingness) we generated 250 missing data patterns and deleted the corresponding values from our dataset.

As mentioned above, we will refer to some results for MCAR mechanisms in order to compare them to the results of the MAR missingness mechanisms. In our MCAR missingness mechanisms we created the number of missings given in the last column of Table 9 (“Total”) for the Low scenario and twice those numbers for the High scenario. The difference with the MAR missingness mechanisms is that for the MCAR missingness mechanisms missingness is not influenced by the value of Age.

4.4 Quality measures

Since we introduced missingness in a dataset with known values ourselves, we are able to compare the imputed values to the actual values. The main interest of NSIs is the production of high-quality descriptive statistics, such as totals and means. To measure the quality of our imputation approach, we therefore examine to which extent totals for Occupation are preserved. We also examine to which extent cell totals are preserved for the cross-table of Educational level and Occupation.

We denote the categories of a variable \(x\) by \(\mathrm{1,2}, \dots ,{C}_{x}\), where \({C}_{x}\) is the number of categories of variable \(x\). For each category \(c\) \((c=\mathrm{1,2}, \dots ,{C}_{x})\) of variable \(x\), we calculate two quality measures, \({B}_{x}(c)\) and \({M}_{x}(c)\), which are defined by

$${B}_{x}\left(c\right)=\frac{\sum_{s=1}^{S}\left({T}_{s\text{,imp}}\left(c;x\right)-{T}_{\text{true}}\left(c;x\right)\right)}{S}$$

and

$${M}_{x}\left(c\right)=\frac{\sum_{s=1}^{S}{\left({T}_{s,{\text{imp}}}\left(c;x\right)-{T}_{\text{true}}\left(c;x\right)\right)}^{2}}{S}$$

where \(S\) is the number of generated missing data patterns (250 in our case), \({T}_{s,\text{ imp}}\left(c;\, x\right)\) is the total of category \(c\) of variable \(x\) in the \(s\)-th imputed dataset \(\left(s=1,\dots ,S\right)\), and \({T}_{\text{true}}\left(c;\, x\right)\) is the corresponding total in the original, complete dataset. \({B}_{x}\left(c\right)\) is the empirical bias of the imputed total of category \(c\) of variable \(x\), and \({M}_{x}\left(c\right)\) its empirical mean squared error.

Similarly, for each combination of categories \(c\) \((c=\mathrm{1,2}, \dots ,{C}_{x})\) and \(c^{\prime}\) \((c^{\prime}=\mathrm{1,2}, \dots ,{C}_{y}\) with \({C}_{y}\) the number of categories of a variable \(y)\) of variables \(x\), respectively \(y\), we calculate two measures, \({B}_{x,y}(c,c^{\prime})\) and \({M}_{x,y}(c,c^{\prime})\), which are defined by

$${B}_{x,y}\left(c,c^{\prime}\right)=\frac{\sum_{s=1}^{S}\left({T}_{s,{\text{imp}}}\left(c,{c}^{\prime};\, x,y\right)-{T}_{\text{true}}\left(c,{c}^{\prime};\, x,y\right)\right)}{S}$$

and

$${M}_{x,y}\left(c,c^{\prime}\right)=\frac{\sum_{s=1}^{S}{\left({T}_{s,{\text{imp}}}\left(c,{c}^{\prime};\, x,y\right)-{T}_{\text{true}}\left(c,{c}^{\prime};\, x,y\right)\right)}^{2}}{S}$$

where \({T}_{s,{\text{imp}}}\left(c,{c}^{\prime};\, x,y\right)\) is the number of times that the combination of category \(c\) of variable \(x\) and category \(c^{\prime}\) of variable \(y\) occurs in the \(s\)-th imputed dataset, and \({T}_{\text{true}}\left(c,{c}^{\prime};\, x,y\right)\) is the number of times that the combination of category \(c\) of variable \(x\) and category \(c^{\prime}\) of variable \(y\) occurs in the original, complete dataset. We summarize \({B}_{x,y}\left(c,c^{\prime}\right)\) and \({M}_{x,y}\left(c,c^{\prime}\right)\) into two quality measures \({B}_{x,y}^{+}\) and \({M}_{x,y}^{+}\) defined by

$${B}_{x,y}^{+}=\frac{\sum_{c=1}^{{C}_{x}}\sum_{{c}^{\prime}=1}^{{C}_{y}}\left|{B}_{x,y}\left(c,c^{\prime}\right)\right|}{{C}_{x}{C}_{y}}$$

and

$${M}_{x,y}^{+}=\frac{\sum_{c=1}^{{C}_{x}}\sum_{{c}^{\prime}=1}^{{C}_{y}}{M}_{x,y}\left(c,c^{\prime}\right)}{{C}_{x}{C}_{y}}$$

In the summation for \({B}_{x,y}^{+}\), we take the absolute values of \({B}_{x,y}\left(c,c^{\prime}\right)\), since we do not want positive and negative values to cancel out. \({B}_{x,y}^{+}\) is the average absolute empirical bias over all cells in the cross-table of \(x\) and \(y\), and \({M}_{x,y}^{+}\) is the average empirical mean squared error over these cells.

Although generally considered to be less important by NSIs, we also look at how often the correct category is imputed for the missing values, i.e. the prediction accuracy of individual values. That is, for variable \(x\), we calculate

$${D}_{x}=\frac{\sum_{s=1}^{S}\sum_{i\in {\text{Miss}}_{x;\, s}}I({x}_{i,s,{\text{imp}}}={x}_{i,{\text{true}}})}{S}$$

where \({x}_{i,s,{\text{imp}}}\) is the value of variable \(x\) in unit \(i\) in the \(s\)-th imputed dataset, \({x}_{i,{\text{true}}}\) is the corresponding value in the original, complete dataset, \({\text{Miss}}_{x;\, s}\) is the set of units for which the value of variable \(x\) is missing in the \(s\)-th sample, and \(I\) is the indicator function, i.e. \(I\left({x}_{i,s,{\text{imp}}}={x}_{i,{\text{true}}}\right)=1\) if \({x}_{i,s,{\text{imp}}}={x}_{i,{\text{true}}}\) and \(I\left({x}_{i,s,{\text{imp}}}={x}_{i,{\text{true}}}\right)=0\) otherwise. We will summarize the results for \({D}_{x}\) into a single number \({D}^{+}\) defined by

$${D}^{+}=\sum_{x}{D}_{x}$$

where the summation runs over all 12 variables. For \({D}^{+}\), the higher its value, the better the quality of the imputations. For the other quality measures, the smaller their values, the better the quality of the imputations.

If using the binomial or the Poisson approximation for the exact imputation probabilities taking known totals has any effect on the quality of the imputations, this should most clearly be reflected in the \({D}^{+}\) measure, since the better the approximations of the imputation probabilities, the more missing values should be imputed correctly.

We have also compared estimated standard errors of the estimated totals for the categories of Occupation obtained by a bootstrap approach to their corresponding (approximate) true standard deviations. The bootstrap approach used to do this and the results thereof are described in Sect. 6.

5 Results

By design, our proposed imputation approach preserves totals for the categories of Educational level in all scenarios. Likewise, by design, our proposed imputation approach also satisfies specified edits.

5.1 Univariate results

We give univariate results for Occupation for the two MAR mechanisms – a MAR mechanism for Educational level and a MAR mechanism for Occupation – with Low missingness. The conclusions that we can draw for High missingness are similar to those for Low missingness. For other variables, the results are similar to those of Occupation (see De Waal and Daalmans 2019).

Table 10 presents univariate results for Occupation for the MAR mechanism in Educational level as well as for the MAR mechanism in Occupation for Low missingness.

Table 10 \({B}_{x}\left(c\right)\) and \({M}_{x}\left(c\right)\) for Occupation for MAR mechanisms with Low missingness

In Table 10, we see that \({B}_{x}\left(c\right)\) is quite low compared to the totals of the categories of Occupation in the complete dataset (see Table 8). Also, standard deviations \({sd}_{x}(c)\), computed as \({sd}_{x}\left(c\right)=\sqrt{{M}_{x}\left(c\right)-{B}_{x}^{2}(c)}\), are quite low for Table 10, i.e. close to zero and much smaller than the totals of the categories of Occupation. So, univariate results for Occupation are preserved quite well for both MAR mechanisms.

5.2 Cross-tables

Table 11 gives the results for \({B}_{x,y}^{+}\) and \({M}_{x,y}^{+}\) for the cross-table of Educational level and Occupation for all four MAR mechanisms. In Appendix D we give results for \({B}_{x,y}\left(c,c^{\prime}\right)\) and \({M}_{x,y}\left(c,c^{\prime}\right)\) for the case of Low Missingness for each cell in the cross-table of Educational level and Occupation for our MCAR missingness mechanism and for the MAR missingness mechanisms for both Educational level and Occupation.

Table 11 Cross-table of educational level and occupation: \({B}_{x,y}^{+}\) and \({M}_{x,y}^{+}\) for MAR mechanisms

In Table 11 we might see some effects of using the binomial or Poisson approximation instead of the IPF approximation for the exact imputation probabilities taking known totals into account. However, we do not see such an effect in Table 11, suggesting that the all three approximations work about equally well.

The results for \({B}_{x,y}^{+}\) and \({M}_{x,y}^{+}\) in Table 11 for the MAR mechanisms are clearly higher than for MCAR mechanisms with the same number of missing values (see the Supplementary Material to this paper). This shows that cross-tables are clearly less well estimated for MAR mechanisms than for MCAR mechanisms. The results in Table 11 show that estimates for the cross-table of Educational level and Occupation are biased for our MAR mechanisms (see also Appendix D).

5.3 Number of correct imputations

As already mentioned, any effects of using the binomial or Poisson approximation instead of the IPF approximation should most clearly be reflected in the \({D}^{+}\) measure. However, we see no such an effect in Table 12. This confirms the earlier finding that the IPF approximation works just as well as the binomial or Poisson approximation for our dataset.

Table 12 Results for \({D}^{+}\) for MAR mechanisms; in brackets the percentage of correct imputations for the total number of missings

The results for \({D}^{+}\) in Table 12 for the MAR mechanisms are clearly worse than for MCAR mechanisms with the same number of missing values (see the Supplementary Material to this paper). This shows that prediction accuracy for individual values decreases substantially for a MAR mechanism in comparison to an MCAR mechanism.

6 Variance estimation

Often an estimate – or at least a good indicator – for the variance of an estimator is considered very important, and that certainly holds true for estimates for a population census. In this section we therefore compare estimated standard errors for the estimated totals of the categories of Occupation obtained by a bootstrap approach to their (approximate) true standard deviations. Since we are dealing with a finite population that we mass impute we have used a pseudo-population bootstrap approach to estimate the standard errors.

As in the rest of this paper, in our pseudo-population bootstrap approach we assume that all units in the population are (at least partly) observed, so the inclusion probability of each population unit is one. However, in each record the values of some variables may be missing. That missingness is caused by a random missingness process, which – for computational reasons – is based on the MCAR mechanism for the Low missingness scenario in this section rather than on MAR mechanisms as in the rest of this paper.

Our situation differs from the usual situation considered in the literature on estimating the variance of estimators based on imputed data in the sense that missingness occurs in all our variables, whereas in the literature missingness is usually assumed to occur in only one variable. Little seems to be known about applying a pseudo-population bootstrap approach when several variables contain missingness. This means that (the application of) our pseudo-population bootstrap approach is somewhat experimental.

Our pseudo-population bootstrap approach is similar to the approach in Scholtus and Daalmans (2021), and is as follows:

  • The starting point is our complete population dataset without any missing values.

  • For \(s=1,\dots ,{N}_{mis}\) we do the following

    1. 1.

      Introduce missingness in our population dataset by means of the MCAR missingness procedure for the Low missingness scenario, and thus create a version, \({Pop}_{mis,s}\), of our population dataset that contains missing values.

    2. 2.

      Impute the missing values by means of our mass imputation approach. This leads to estimated total \({T}_{s\text{,imp}}\left(c;\, {\text{Occupation}}\right)\) for each category \(c\) of Occupation.

    3. 3.

      Select the set of completely observed units \({Pop}_{mis,com,s}\) in \({Pop}_{mis,s}\).

    4. 4.

      Create one complete pseudo-population of the same size (3,784 units) as our original population. Suppose that \({Pop}_{mis,com,s}\) consists of \({n}_{s}\) completely observed values. Since we have used an MCAR missingness mechanism, each unit is equally likely to be completely observed. We therefore start by creating \(\left\lfloor {\frac{3784 }{{n_{s} }}} \right\rfloor\) copies of each unit in \({Pop}_{mis,com,s}\), where \(\left\lfloor q \right\rfloor\) denotes the largest integer value less than \(q\). The total number of units created in this way is very likely to be less than 3,784. If so, we then randomly draw some extra units from \({Pop}_{mis,com,s}\), where each unit may be drawn extra only once, until we have created a pseudo-population \({Pseudo}_{s}\) with 3,784 complete units.

    5. 5.

      For \(b=1,\dots ,B\) we do the following

      1. a.

        Introduce missingness in \({Pseudo}_{s}\) by means of our MCAR missingness procedure, so we obtain a pseudo-population \({Pseudo}_{mis,s}\) with missing values.

      2. b.

        Impute \({Pseudo}_{mis,s}\) by means of our mass imputation approach.

      3. c.

        Calculate the total \({T}_{s\text{,pseudo}}\left(c;\, {\text{Occupation}}\right)\) for each category \(c\) of Occupation in the imputed version of \({Pseudo}_{mis,s}\).

    6. 6.

      For each category \(c\) of Occupation, calculate the bootstrap variance \({\text{var}}_{s,boot}\left(c\right)={\left(B-1\right)}^{-1}\sum_{b=1}^{B}{\left({T}_{s\text{,pseudo}}\left(c;\, {\text{Occupation}}\right)-{\overline{T} }_{s\text{,pseudo}}\left(c;\, {\text{Occupation}}\right)\right)}^{2}\) with \({\overline{T} }_{s\text{,pseudo}}\left(c;\, {\text{Occupation}}\right)={B}^{-1}\sum_{b=1}^{B}{T}_{s\text{,pseudo}}\left(c;\, {\text{Occupation}}\right)\).

    7. 7.

      For each category \(c\) of Occupation, estimate the standard error of the estimated total of category \(c\) of Occupation by \({se}_{s,boot}\left(c\right)=\sqrt{{\text{var}}_{s,boot}\left(c\right)}\).

In our simulation study we have used \({N}_{mis}=250\) and \(B=200\). As noted in the literature, for variance estimation, \(B=200\) bootstrap replicates are often considered sufficient (see Efron and Tibshirani 1993, Sect. 6.4).

In Step 3 we create only one pseudo-population. In principle, it may be better to construct several pseudo-populations. However, previous results in Chauvet (2007) and Kuijvenhoven and Scholtus (2011) suggest that creating several pseudo-populations instead of only one hardly affects the estimated variances. In other words, creating a single pseudo-population as we did, often leads to variance estimates of similar accuracy as creating several pseudo-populations.

We approximated true standard deviations for the categories of Occupation by introducing missingness into our original population by means of our MCAR procedure for the Low missingness scenario 2,500 times. To each of these 2,500 datasets with missing values we applied our mass imputation approach, calculated the totals for the categories of Occupation in the imputed datasets, and estimated the true standard deviations over the 2,500 datasets.

Our pseudo-population bootstrap approach to estimate the standard errors of the totals of Occupation is very time-consuming since we applied our iterative imputation method \({N}_{mis}\times B=250\times 200=\mathrm{50,000}\) times. Since we use ten iterations in our imputation method, we had to apply our imputation method \(\mathrm{500,000}\) times in total. For this reason we have used a MCAR missingness mechanism, instead of MAR missingness mechanisms as we did in the rest of our paper. Even with a MCAR missingness procedure and using parallelized R code on a PC with four computing cores, it took more than two months to run this simulation study.

We note that even with a Low missingness scenario the expected percentage of units with one or more missing values is about 48.66%, so only about 51.33% of the units, i.e. \(51.33\%\times 3784\approx 1942\) units, are expected to be completely observed and can be used to construct a pseudo-population.

Table 13 gives the averages of estimated standard errors over \({N}_{mis}\) populations with missing values divided by true (approximate) standard deviations for the estimated totals of the categories of Occupation. We see that in most cases the average of the estimated standard error divided by the corresponding true (approximate) standard deviation is close to, but a bit less than, one. This shows that for most cases the estimates for the variances based on our pseudo-population bootstrap approach are close to the true variances of the imputation approach. Exceptions are category “6”– the category with the smallest true total – which is clearly less than one, and “999”– the category with the highest true total – which is a bit larger than one.

Table 13 Averages of the estimated standard errors over the \({N}_{mis}=250\) populations with missing values divided by true (approximate) standard deviations of the categories of Occupation

Table 14 gives coverage rates of the estimated \(95\%\)-confidence intervals for the categories of Occupation. These \(95\%\)-confidence intervals are computed as

Table 14 Coverage rates of the estimated confidence intervals of the categories of Occupation
$$\left({T}_{s\text{,imp}}\left(c;\, {\text{Occupation}}\right)-1.96\times {se}_{s,boot}\left(c\right),{T}_{s\text{,imp}}\left(c;\, {\text{Occupation}}\right)+1.96\times {se}_{s,boot}\left(c\right)\right).$$

Note that the estimated \(95\%\)-confidence intervals are centered around \({T}_{s\text{,imp}}\left(c;\, {\text{Occupation}}\right)\), i.e. around totals obtained by applying our mass imputation procedure directly to \({Pop}_{mis,s}\) \(\left(s=1,\dots ,{N}_{mis}\right)\) and before constructing a pseudo-population. This way of computing the confidence intervals was suggested by a reviewer.

The coverage rates of the estimated \(95\%\)-confidence intervals should be close to the nominal rate of 95%, which is indeed the case for all categories except category “6” as can been seen in Table 14

7 Discussion

In this paper we have generalized the imputation approach of Favre et al. (2005) to multiple variables to be imputed. We have also tested three approximations for the imputation probabilities taking known totals into account. We have carried out a simulation study to examine the properties of our proposed methodology in several different situations.

The first conclusion that we can draw is that our proposed methodology does work in the sense that it allows us to impute multivariate missing data such that edits are satisfied and known totals are exactly preserved.

For MCAR missing data mechanisms (see the Supplementary Material to this paper), the univariate results for individual variables and results for two-dimensional cross-tables are (nearly) unbiased. For MAR missing data mechanisms, univariate results are also (nearly) unbiased. However, whereas results for cross-tables are (nearly) unbiased for MCAR data mechanisms, they are biased for MAR mechanisms. Also, the prediction accuracy for individual values is much lower for MAR mechanisms than for MCAR mechanisms. In order to preserve cross-tables better for MAR mechanisms one should use imputation models that capture statistical relations between variables better than our relatively simple multinomial imputation models do. For instance, for our MAR mechanisms we could build different multinomial imputation models for each of our three age classes (see Sect. 4.3). Testing such more advanced imputation models that better capture statistical relations between variables is a point for future research.

In Sect. 6 we examined the use of a pseudo-population bootstrap approach to estimate the standard errors of the totals of Occupation. For most categories of Occupation, the averages of estimated standard errors obtained by the bootstrap approach were quite close to the corresponding (approximate) true standard deviations. Also, coverage rates of the estimated confidence intervals of the estimated totals for the categories of Occupation were for most categories quite close to the nominal rate of 95%. This suggests that using our pseudo-population bootstrap approach generally gives reasonably estimates for the imputation variance.

In this paper, we focused on mass imputation as this is the most relevant situation for the Dutch Population Census. However, a modified version of the proposed imputation approach also seems useful for cases where a sample of the population is imputed, and weighted sums of the imputed values have to sum up to known population totals for some variables. Favre et al. (2004 and 2005) developed a variant of Cox’s controlled rounding algorithm that is able to handle this situation. This variant can be included into our imputation approach by replacing the original version of Cox’ controlled rounding algorithm with the variant developed by Favre et al. (2004). We leave this extension of our imputation approach to potential future work.