Calibrated imputation for multivariate categorical data

de Waal, Ton; Daalmans, Jacco

doi:10.1007/s10182-023-00481-z

Calibrated imputation for multivariate categorical data

Original Paper
Open access
Published: 05 October 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

AStA Advances in Statistical Analysis Aims and scope Submit manuscript

Calibrated imputation for multivariate categorical data

Download PDF

555 Accesses
Explore all metrics

Abstract

Non-response is a major problem for anyone collecting and processing data. A commonly used technique to deal with missing data is imputation, where missing values are estimated and filled in into the dataset. Imputation can become challenging if the variable to be imputed has to comply with a known total. Even more challenging is the case where several variables in the same dataset need to be imputed and, in addition to known totals, logical restrictions between variables have to be satisfied. In our paper, we develop an approach for a broad class of imputation methods for multivariate categorical data such that previously published totals are preserved while logical restrictions on the data are satisfied. The developed approach can be used in combination with any imputation model that estimates imputation probabilities, i.e. the probability that imputation of a certain category for a variable in a certain unit leads to the correct value for this variable and unit.

A nonparametric multiple imputation approach for missing categorical data

Article Open access 06 June 2017

Multiple imputation for nonignorable missing data

Article 12 June 2017

The effect of high prevalence of missing data on estimation of the coefficients of a logistic regression model when using multiple imputation

Article Open access 18 July 2022

1 Introduction

Non-response is a major problem for anyone collecting and processing data, such as National Statistical Institutes (NSIs). When left untreated, non-response can lead to biased estimates or invalid results from statistical analyses. Non-response can be subdivided into item non-response, where some values from otherwise observed units are missing, and unit non-response, where entire units are not observed.

A commonly used technique to deal with missing data is imputation (see, e.g., Rubin 1987, Schafer 1997, Little and Rubin 2002, De Waal et al. 2011 and Van Buuren 2012). In imputation, missing values are estimated and filled in into the dataset. Imputation is particularly used often for item non-response. It is sometimes also used for unit non-response, although weighting is a more common technique for correcting for unit-nonresponse.

Imputation can become challenging if the variable to be imputed has to comply with a known total. This situation occurs often if the total has been published before. Deviation from an earlier published total is deemed undesirable by many NSIs as this may lead to possible confusion among users due to conflicting results for the same phenomena. Even more challenging is the case when several variables in the same dataset need to be imputed. In addition to the known totals there can be so-called edit rules (or edits for short) that have to be satisfied by the data. An example of an edit is that a baby cannot have completed primary school.

We illustrate the problem with an example. Statistics Netherlands publishes information on the highest educational level attained. In the Netherlands, the first results on the highest educational level attained are based on weighting the so-called Education Attainment File (EAF). Later, information based on the highest educational level attained is combined with other data sources to construct a virtual population census, i.e. a population census that is mainly based on administrative data covering the entire population. In the case of the Dutch Population Census, all variables except highest educational level attained and occupation are based on integral administrative data. After construction of the virtual population census, we can break down information on the highest educational level attained into detailed groups of the population by using the background information available in the population census. To facilitate the estimation process for the virtual population census, highest educational level attained is mass imputed in the virtual population census, i.e. highest educational level attained is imputed for all population units for which no value has been observed. In that way, a complete dataset for the entire Dutch population is constructed, which can be used for multiple estimation purposes. Daalmans (2017) proposes a method for mass imputation of highest educational level attained based on logistic regression that can be used for the Dutch Population Census. However, the results for highest educational level attained based on the population census will deviate from the earlier published results based on weighting the EAF if standard imputation techniques, such as logistic regression, are used. Besides highest educational level attained, other variables, in particular occupation, have missing values and need to be imputed.

As far as we are aware, only one method has thus far been proposed in the literature that allows one to impute categorical data with missing values for multiple variables such that previously published totals are preserved and specified edits are satisfied (De Waal et al. 2017). However, that method is very time-consuming, and can only be applied to relatively small problem instances. In some cases, the method also has to “backtrack”, i.e. a previously imputed variable may need to be imputed in a different way. As noted by De Waal et al. (2017), this would lead to an even more time-consuming and extremely complicated process.

In the current paper, we propose an imputation approach that can be used for a broad class of imputation methods for multivariate categorical data such that previously published totals are preserved by the imputed data while edits are satisfied, and that can handle much larger problem instances than the approach by De Waal et al. (2017). This imputation approach is based on adding a calibration step to standard imputation techniques. It can be used in combination with any imputation model that estimates imputation probabilities, i.e. the probability that imputation of a certain category for a variable in a certain unit leads to the correct value for this variable and unit. Examples of imputation models that estimate such imputation probabilities are multinomial models and logistic regression models.

Our proposed imputation approach generalizes an approach by Favre et al. (2005). In their approach, only one categorical variable is to be imputed subject to edits and known totals. We generalize this to the multivariate case where multiple categorical variables have to be imputed subject to edits and known totals. We achieve this by adopting a fully conditional specification approach (see Subsect. 3.2) that takes the previously published totals into account in combination with a Fellegi–Holt approach to satisfy all edits (see Fellegi and Holt 1976 and the Subsect. 3.1 of the current paper). Whereas Favre et al. (2005) approximate imputation probabilities given that known totals have to be preserved in only one way, we also examine two alternative approximations (see Sect. 3.3). In this paper, we will assume that all population units will be imputed, i.e. that mass imputation is used, and that therefore no weighting is necessary to obtain estimates for population totals.

Section 2 of this paper first discusses the approach developed by Favre et al. (2005) for a single categorical variable with missing data. Section 3 discusses our proposed generalization to multivariate categorical missing data. Section 4 describes the evaluation study that we carried out to assess the proposed generalization, while Sect. 5 examines the results of this study. Section 6 examines the estimation of imputation variance by means of a pseudo-population bootstrap approach. Section 7 concludes the paper with a short discussion.

2 Approach by Favre, Matei and Tillé for univariate missing data

The approach of Favre et al. (2005) for imputation of univariate missing categorical data subject to edits and known totals consists of four steps. In the first step, user-specified edits are used to find structural zeroes for the variable to be imputed, i.e. for each record in the dataset the categories that are not allowed according to the observed values in combination with the specified edits are determined. In the second step, for each record the imputation probabilities of the categories that are allowed are estimated. In the implementation of Favre et al. (2005) this is done by assuming a multinomial logistic model, taking the structural zeroes into account. In the third step, these probabilities are calibrated so that, for each category, they sum up to the corresponding known total and, for each record, to one. This is achieved by using iterative proportional fitting (IPF). In the fourth step, Cox’ controlled rounding algorithm (see Cox 1987) is used to fix one of the probabilities for the allowed categories per record to one and the probabilities of the other allowed categories to zero. The category for which the probability is set to one for a certain record is imputed in that record.

We illustrate the basic ideas of the approach by Favre et al. (2005) by means of an example. In this example one variable with three categories is to be imputed in eight units. In the first step of the approach, the observed values for the other variables are filled in into the edits to find the structural zeroes for the variable to be imputed. For convenience, we assume that this step has already been carried out for all units in the dataset.

The units to be imputed are given in Table 1. The known totals are given in the last row.

Table 1 Units to be imputed and totals per category

Calibrated imputation for multivariate categorical data

Abstract

Similar content being viewed by others

A nonparametric multiple imputation approach for missing categorical data

Multiple imputation for nonignorable missing data

The effect of high prevalence of missing data on estimation of the coefficients of a logistic regression model when using multiple imputation

1 Introduction

2 Approach by Favre, Matei and Tillé for univariate missing data

3 Generalization to multiple variables with missing data

3.1 The start-up phase

3.2 The actual imputation phase

3.3 Adjusting the imputation probabilities

4 Evaluation study

4.1 Implementation

4.2 Dataset

4.3 The simulation study

4.4 Quality measures

5 Results

5.1 Univariate results

5.2 Cross-tables

5.3 Number of correct imputations

6 Variance estimation

7 Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

Supplementary file1 (DOCX 46 kb)

Appendices

Appendix A: Technical details Fellegi–Holt elimination approach

Appendix B: Categories of the variables

Appendix C: User-specified edit rules

Appendix D: Cross-table of educational level and occupation: \({{\varvec{B}}}_{{\varvec{x}},{\varvec{y}}}\left({\varvec{c}},{{\varvec{c}}}^{\boldsymbol{^{\prime}}}\right)\) and \({{\varvec{M}}}_{{\varvec{x}},{\varvec{y}}}\left({\varvec{c}},{{\varvec{c}}}^{\boldsymbol{^{\prime}}}\right)\)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation