Abstract
Multivariate nonresponse is often encountered in complex survey sampling, and simply ignoring it leads to erroneous inference. In this paper, we propose a new matrix completion method for complex survey sampling. Different from existing works either conducting row-wise or column-wise imputation, the data matrix is treated as a whole which allows for exploiting both row and column patterns simultaneously. A column-space-decomposition model is adopted incorporating a low-rank structured matrix for the finite population with easy-to-obtain demographic information as covariates. Besides, we propose a computationally efficient projection strategy to identify the model parameters under complex survey sampling. Then, an augmented inverse probability weighting estimator is used to estimate the parameter of interest, and the corresponding asymptotic upper bound of the estimation error is derived. Simulation studies show that the proposed estimator has a smaller mean squared error than other competitors, and the corresponding variance estimator performs well. The proposed method is applied to assess the health status of the U.S. population.
Similar content being viewed by others
References
Alaya, M. Z., Klopp, O. (2019). Collective matrix completion. Journal of Machine Learning Research, 20(148), 1–43.
Andridge, R. R., Little, R. J. (2010). A review of hot deck imputation for survey non-response. International Statistical Review, 78(1), 40–64.
Athreya, K. B., Lahiri, S. N. (2006). Measure theory and probability theory. New York: Springer.
Bi, X., Qu, A., Wang, J., Shen, X. (2017). A group-specific recommender system. Journal of the American Statistical Association, 112(519), 1344–1353.
Cai, T. T., Zhou, W.-X. (2016). Matrix completion via max-norm constrained optimization. Electronic Journal of Statistics, 10(1), 1493–1525.
Candès, E. J., Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6), 717–772.
Carpentier, A., Kim, A. K. (2018). An iterative hard thresholding estimator for low rank matrix recovery with explicit limiting distribution. Statistica Sinica, 28, 1371–1393.
Chang, T., Kott, P. S. (2008). Using calibration weighting to adjust for nonresponse under a plausible model. Biometrika, 95, 555–571.
Chen, J., Shao, J. (2000). Nearest neighbor imputation for survey data. Journal of Official Statistics, 16, 113–131.
Chen, T. C., Clark, J., Riddles, M. K., Mohadjer, L. K., Fakhouri, T. H. I. (2020). National health and nutrition examination survey, 2015–2018: Sample design and estimation procedures. National Center for Health Statistics. Vital Health Stat, 2(184), 1–26.
Chen, Y., Fan, J., Ma, C., Yan, Y. (2019). Inference and uncertainty quantification for noisy matrix completion. Proceedings of the National Academy of Sciences, 116(46), 22931–22937.
Chen, Y., Li, P., Wu, C. (2020). Doubly robust inference with nonprobability survey samples. Journal of the American Statistical Association, 115(532), 2011–2021.
Clogg, C. C., Rubin, D. B., Schenker, N., Schultz, B., Weidman, L. (1991). Multiple imputation of industry and occupation codes in census public-use samples using Bayesian logistic regression. Journal of the American Statistical Association, 86, 68–78.
Davenport, M. A., Romberg, J. (2016). An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4), 608–622.
Davenport, M. A., Plan, Y., van den Berg, E., Wootters, M. (2014). 1-bit matrix completion. Information and Inference, 3(3), 189–223.
Elliott, M. R., Valliant, R. (2017). Inference for nonprobability samples. Statistical Science, 32(2), 249–264.
Fan, J., Gong, W., Zhu, Z. (2019). Generalized high-dimensional trace regression via nuclear norm regularization. Journal of Econometrics, 212(1), 177–202.
Fay, R. E. (1992). When are inferences from multiple imputation valid? Proceedings of the survey research methods section of the American Statistical Association, 227–232. American Statistical Association.
Fletcher Mercaldo, S., Blume, J. D. (2018). Missing data and prediction: The pattern submodel. Biostatistics, 21(2), 236–252.
Foucart, S., Needell, D., Plan, Y., Wootters, M. (2017). De-biasing low-rank projection for matrix completion. Wavelets and sparsity XVII, Vol. 10394, p. 1039417. International Society for Optics and Photonics.
Fuller, W. A. (2009). Sampling statistics. Hoboken, NJ: Wiley.
Fuller, W. A., Kim, J. K. (2005). Hot deck imputation for the response model. Survey Methodology, 31, 139.
Harchaoui, Z., Douze, M., Paulin, M., Dudik, M., Malick, J. (2012). Large-scale image classification with trace-norm regularization. 2012 IEEE conference on computer vision and pattern recognition, 3386–3393. IEEE.
Horvitz, D. G., Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663–685.
Isaki, C. T., Fuller, W. A. (1982). Survey design under the regression superpopulation model. Journal of the American Statistical Association, 77, 89–96.
Keiding, N., Louis, T. A. (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys. Journal of the Royal Statistical Society: Series A (Statistics in Society), 179, 319–376.
Kim, E., Lee, M., Oh, S. (2015). Elastic-net regularization of singular values for robust subspace learning. Proceedings of the IEEE conference on computer vision and pattern recognition, 915–923.
Kim, J. K., Fuller, W. (2004). Fractional hot deck imputation. Biometrika, 91, 559–578.
Kim, J. K., Yu, C. L. (2011). A semiparametric estimation of mean functionals with nonignorable missing data. Journal of the American Statistical Association, 106, 157–165.
Kim, J. K., Brick, J., Fuller, W. A., Kalton, G. (2006). On the bias of the multiple-imputation variance estimator in survey sampling. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 509–521.
Koltchinskii, V., Lounici, K., Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Annals of Statistics, 39(5), 2302–2329.
Li, H., Chen, N., Li, L. (2012). Error analysis for matrix elastic-net regularization algorithms. IEEE Transactions on Neural Networks and Learning Systems, 23(5), 737–748.
Liu, W., Mao, X., Wong, R. K. W. (2020). Median matrix completion: From embarrassment to optimality. Proceedings of the 37th International Conference on Machine Learning, Vol. 119, 294–6304.
Mao, X., Chen, S. X., Wong, R. K. (2019). Matrix completion with covariate information. Journal of the American Statistical Association, 114(525), 198–210.
Mao, X., Wong, R. K., Chen, S. X. (2021). Matrix completion under low-rank missing mechanism. Statistica Sinica, 31(4), 2005–2030.
Mazumder, R., Hastie, T., Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11, 2287–2322.
Meng, X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 9, 538–558.
Molenberghs, G., Michiels, B., Kenward, M. G., Diggle, P. J. (1998). Monotone missing data and pattern-mixture models. Statistica Neerlandica, 52(2), 153–161.
Negahban, S., Wainwright, M. J. (2012). Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. Journal of Machine Learning Research, 13(1), 1665–1697.
Nielsen, S. F. (2003). Proper and improper multiple imputation. International Statistical Review, 71, 593–607.
Pfeffermann, D. (1993). The role of sampling weights when modeling survey data. International Statistical Review, 61, 317–337.
Qin, J., Zhang, B., Leung, D. H. (2017). Efficient augmented inverse probability weighted estimation in missing data problems. Journal of Business & Economic Statistics, 35(1), 86–97.
Rao, J. N. K., Shao, J. (1999). Modified balanced repeated replication for complex survey data. Biometrika, 86(2), 403–415.
Robin, G., Klopp, O., Josse, J., Moulines, É., Tibshirani, R. (2020). Main effects and interactions in mixed and incomplete data frames. Journal of the American Statistical Association, 115(531), 1292–1303.
Robins, J. M., Rotnitzky, A., Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89, 846–866.
Robins, J. M., Rotnitzky, A., Zhao, L. P. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association, 90, 106–121.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.
Rubin, D. B. (1978). Multiple imputations in sample surveys—a phenomenological Bayesian approach to nonresponse. Proceedings of the survey research methods section of the American Statistical Association, Vol. 1, 20–34. American Statistical Association.
Sengupta, N., Srebro, N., Evans, J. (2021). Simple surveys: Response retrieval inspired by recommendation systems. Social Science Computer Review, 39(1), 105–129.
Sun, T., Zhang, C.-H. (2012). Calibrated elastic regularization in matrix completion. Advances in Neural Information Processing Systems, 25, 863–871.
Sweeting, T. (1980). Uniform asymptotic normality of the maximum likelihood estimator. Annals of Statistics, 1375–1381.
Tan, Z. (2013). Simple design-efficient calibration estimators for rejective and high-entropy sampling. Biometrika, 100(2), 399–415.
Tang, G., Little, R. J., Raghunathan, T. E. (2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika, 90, 747–764.
van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67.
van der Linden, W. J., Hambleton, R. K. (2013). Handbook of modern item response theory. New York, NY: Springer.
Wang, N., Robins, J. M. (1998). Large-sample theory for parametric multiple imputation procedures. Biometrika, 85, 935–948.
Wang, S., Shao, J., Kim, J. K. (2014). An instrument variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, 24(3), 1097–1116.
Wang, Z., Peng, L., Kim, J. K. (2022). Bootstrap inference for the finite population mean under complex sampling designs. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Accepted.
Wu, C. (2003). Optimal calibration estimators in survey sampling. Biometrika, 90(4), 937–951.
Yang, S., Kim, J. K. (2016). A note on multiple imputation for method of moments estimation. Biometrika, 103(1), 244–251.
Yang, S., Wang, L., Ding, P. (2019). Causal inference with confounders missing not at random. Biometrika, 106(4), 875–888.
Zhang, C., Taylor, S. J., Cobb, C., Sekhon, J. (2020). Active matrix factorization for surveys. Annals of Applied Statistics, 14(3), 1182–1206.
Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
Acknowledgements
We are grateful to two referees and the Associate Editor for their constructive comments which have greatly improved the paper. Mao is partially supported by NSFC (No.: 12001109, 92046021) and the Science and Technology Commission of Shanghai Municipality grant 20dz1200600. Wang is partially supported by NSFC (No.: 11901487, 72033002) and the Fundamental Scientific Center of National Natural Science Foundation of China Grant No. 71988101. Yang is partly supported by the NSF DMS 1811245, NIH 1R01AG066883 and 1R01ES031651.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendices
Appendix
A Technical conditions
The technical conditions needed for our analysis are given as follows.
-
C1
(a) The random errors \(\{\epsilon _{ij}:i=1,\ldots ,N;j=1,\ldots ,L\}\) in (2) are independently distributed random variables such that \(E(\epsilon _{ij})=0\) and \(E(\epsilon ^2_{ij})=\sigma _{ij}^2<\infty\) for all i, j. (b) For some finite positive constants \(c_{\sigma }\) and \(\eta\), \({\max }_{i,j}E|\epsilon _{ij}|^{l}\le \frac{1}{2}l!c_{\sigma }^{2}\eta ^{l-2}\) for any positive integer \(l\ge 2\).
-
C2
The inclusion probability satisfies \(\pi _i\asymp nN^{-1}\) for \(i=1,\ldots ,N\).
-
C3
The population design matrix \({\varvec{X}}_N\) is of size \(N\times d\) such that \(N>d\). Moreover, there exists a positive constant \(a_{x}\) such that \(\Vert {\varvec{X}}_N\Vert _{\infty }\le a_{x}\) and \({\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N\) is invertible, where \({\varvec{D}}_N\) is a diagonal matrix with \(\pi _i\) as its (i, i)th entry. Furthermore, there exists a symmetric matrix \({\varvec{S}}_{{\varvec{X}}}\) with \(\sigma _{\min }({\varvec{S}}_{{\varvec{X}}})\asymp 1\asymp \Vert {\varvec{S}}_{{\varvec{X}}}\Vert\) such that \(n_0^{-1}{\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N\rightarrow {\varvec{S}}_{{\varvec{X}}}\) as \(N\rightarrow \infty\), where \(n_0=\sum _{i=1}^N\pi _i\) is the expected sample.
-
C4
There exists a positive constant a such that \(\max \{\Vert {\varvec{X}}_N\varvec{\beta }^{*}\Vert _{\infty },\Vert {\varvec{A}}_N\Vert _{\infty }\}\le a\).
-
C5
The indicators of observed entries \(\{r_{ij}:i=1,\ldots ,N;j=1,\ldots ,L\}\) are mutually independent, \(r_{ij}\sim \text {Bern}(p_{ij})\) for \(p_{ij}\in (0,1)\) and are independent of \(\{\epsilon _{ij}\}_{i,j=1}^{N,L}\) given \({\varvec{X}}_N\). Furthermore, for \(i=1,\dots ,N\) and \(j=1,\dots ,L\), \(\Pr (r_{ij}=1 | {\varvec{x}}_{i}, y_{ij}) = \Pr (r_{ij}=1 | {\varvec{x}}_{i})\) follows the logistic regression model (7).
-
C6
There exists a lower bound \(p_{\min }\in (0,1)\) such that \({\min }_{i,j}\{p_{ij}\}\ge p_{\min }>0\), where \(p_{\min }\) is allowed to depend on n and L. The number of questions \(L\le n\).
-
C7
The sampling design satisfies that \(N^{-1}\sum _{i=1}^Ny_i\pi _i^{-1} = O_p(n^{-1/2})\) if \(N^{-1}\sum _{i=1}^Ny_i^{2}\) is asymptotically bounded.
Condition C1(a) is a common regularity condition for the measurement errors in \(\varvec{\epsilon }_{N}\), and C1(b) is the Bernstein condition (Koltchinskii et al., 2011). Condition C2 is widely used in survey sampling and regulates the inclusion probabilities of a sampling design (Fuller, 2009). In Condition C3, the requirement \(N>d\) is easily met as the number of questions in a survey is usually fixed, and the population size is often larger than the number of questions. As the dimension of \(n_0^{-1}{\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N\) is fixed at \(d\times d\), it is mild to assume \({\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N\) to be invertible, and there exists a symmetric matrix \({\varvec{S}}_{{\varvec{X}}}\) as the limit of \(n_0^{-1}{\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N\). Please notice that we do not assume randomness for generating \({\varvec{X}}_N\), and it is a common assumption for design-based framework. Furthermore, the sample size is often larger than the number of questions, that is, \(n>d\), and it is not hard to show that together with Condition C2, the probability limit of \(n^{-1}{\varvec{X}}_n^{\mathrm{T}}{\varvec{X}}_n\) is also \({\varvec{S}}_{{\varvec{X}}}\) under regularity conditions. The order of \(\sigma _{\min }({\varvec{S}}_{{\varvec{X}}})\) and \(\Vert {\varvec{S}}_{{\varvec{X}}}\Vert\) equals to 1 is due to \(\Vert {\varvec{X}}_N\Vert _{\infty }<\infty\). Condition C4 is also standard in the matrix completion literature (Cai and Zhou, 2016; Koltchinskii et al., 2011; Negahban and Wainwright, 2012). Especially, it is reasonable to assume all the responses are bounded in survey sampling. Condition C5 describes the independent Bernoulli model for the response indicator of observing \(y_{ij}\), where the probability of observation \(p_{ij}\) follows the logistic model (7). In Condition C6, the lower bound \(p_{\min }\) is allowed to go to 0 with n and L growing. This condition is more general than we need for a typical survey, and \(p_{\min }\asymp 1\) suffices. Typically, the number of questions L grows slower than the number of participants n in survey sampling. Thus, the assumption that \(L\le n\) is quite mild. Condition C7 is a mild restriction on the estimator for the population mean, and it can be satisfied under general sampling designs. To get general results, we do not make any assumptions for the asymptotic relationship between the population size N and the sample size n; see Theorem 1 for details. We can make further assumptions for the sample sizes to guarantee certain convergence properties; see the discussion of Theorem 3.
B Lemmas
Under the logistic model (7), together with the results in Mao et al. (2019) and Sweeting (1980), it can been shown that for all \(t>d+3\), there exist some positive constants \(C_g\), C and \(C_{d}\) such that \(\Pr \{\sum _{ij}(1/\widehat{p}_{ij}-1/p_{ij})^{2}\ge C_g p_{\min }^{-3}t\}\le C_{d}t\exp \{-t/2\}+L\max _{j}\sup _{t}|\Pr \{\sum _{i}(1/\widehat{p}_{ij}-1/p_{ij})^{2}\ge t\}-\Pr (\chi ^{2}_{d+1}\ge C_g p_{\min }^{-3}t)|\). Then, \(\max _{j}\sup _{t}|\Pr \{\sum _{i}(1/\widehat{p}_{ij}-1/p_{ij})^{2}\ge t\}-\Pr (\chi ^{2}_{d+1}\ge C_g p_{\min }^{-3}t)|\le L^{-2}\) and \(C_{d}t\exp \{-t/2\}\) is a function independent of n and L such that \(\lim _{t\rightarrow \infty }t\exp \{-t/2\}= 0\).
Write \({\varvec{J}}_{ij}={\varvec{e}}_{i}(n_1){\varvec{e}}^\intercal _{j}(n_2)\), where \({\varvec{e}}_{i}(n)\in {\mathbb {R}}^n\) is the standard basis vector with the i-th element being 1 and the rest being 0. Now we present several lemmas.
Lemma 1
Under Conditions C2 and C3 and Poisson sampling, we have
Proof of Lemma 1
Denote \({\varvec{e}}_i\) to be a column vector of length d with jth element being 1 and others being 0. Recall that \(n_0 = \sum _{k=1}^N\pi _k\) is the expected sample size. For \(i=1,\ldots ,d\) and \(j=1,\ldots ,d\), consider
where the expectation is take with respect to the sampling design, and \({\varvec{x}}_i\) is the ith row of \({\varvec{X}}_N\).
Under Poisson sampling, we have
By a similar argument in Wang et al. (2019), we can show that \(n/n_0\rightarrow 1\) in probability as \(N\rightarrow \infty\) under Condition C2. By Condition C3, (16) and (17), we have proved Lemma 1. \(\square\)
Lemma 2
Let \(\Psi ^{(1)}=\sum _{ij}r_{ij}\epsilon _{ij}{\varvec{J}}_{ij}/(nL\widehat{p}_{ij}\pi _{ij}^{1/2})\). Under Conditions C1, C5 and C6 and Poisson sampling, for some positive constants \(C_1\), \(c_{\sigma }\), \(\eta\), \(\delta _{\sigma }\) and all \(t>d+3\), we have
holds with probability at least \(1-1/(n+L)-C_{d}t\exp \{-t/2\}-1/L-12c_{\sigma }^{2}\eta ^{2}\log ^{-\delta _{\sigma }}(n)\).
Lemma 3
Let \(\Psi ^{(2)}=\sum _{ij}a_{ij}(r_{ij}/p_{ij}-1){\varvec{J}}_{ij}/(nL\pi _{ij}^{1/2})\). Under Conditions C4–C6 and Poisson sampling, for some positive constants \(C_2\), we have
holds with probability at least \(1-1/(n+L)\).
Lemma 4
Let \(\Psi ^{(3)}=\sum _{ij}a_{ij}(r_{ij}/{\widehat{p}}_{ij}-r_{ij}/p_{ij}){\varvec{J}}_{ij}/(nL\pi _{ij}^{1/2})\). Under Conditions C4 and C6 and Poisson sampling, for some positive constants \(C_3\), \(\delta _{\sigma }\) and all \(t>d+3\), we have
holds with probability at least \(1-C_{d}t\exp \{-t/2\}-1/L\).
It is easy to show Lemma 2–4 by the proof of Lemma S4.1–S4.3 in the supplementary material of Mao et al. (2019).
C Proofs
1.1 C.1 Proof of Theorem 1
With the definition of \(\Delta (\delta _{\sigma },t)\) in (12), under Conditions C1–C6 and Poisson sampling, together with Lemmas 2–4, We have for a positive constant \(C_0\),
with probability at least \(1-2/n-2C_{d}t\exp \{-t/2\}-2/L-12c_{\sigma }^{2}\eta ^{2}\log ^{-\delta _{\sigma }}(n)\).
Let \({\varvec{X}}_n^{\prime }={\varvec{D}}_n^{-1/2}{\varvec{X}}_n\) and \(\eta _{n,L}(\delta _{\sigma },t) = 4/(n+L)+ 4C_{d}t\exp \{-t/2\}+4/L+C\log ^{-\delta _{\sigma }}(n)\) for a positive constant C. By choosing t as (13), \(\tau _1\asymp N^{-1}nL^{-1}\log ^{-1/2}(n)\Delta (\delta _{\sigma })\) and \(\tau _2\asymp \eta _{g}^{-1/2}N^{-1}n^{1/4}L^{-1/4}\log ^{1/2}(L)\log ^{\delta _{\sigma }/3}(n)\), where \(\Delta (\delta _{\sigma })=N^{1/2}n^{-1}L^{-1}\log ^{1/2}(n)p_{\min }^{-1/2}\) and \(1-\alpha \asymp (nL)^{-1}\) in (8) for any \(\delta _{\sigma }>0\), together with Condition C2 and Poisson sampling, it follows the same proof with the proof of Corollary 1 in Mao et al. (2019) that, for some constants \(C_1\) and \(C_2\), with probability at least \(1-\eta _{n,L}(\delta _{\sigma },t)\),
Thus it is easy to obtain that
under Condition C3. \(\square\)
1.2 C.2 Proof of Theorem 2
Due to the observations that
together with Theorem 1, it is easy to obtain the result under Condition C2 and Poisson sampling.\(\square\)
1.3 C.3 Proof of Theorem 3
Denote
for \(j=1,\ldots ,L\). The difference between \(\widehat{\theta }_{j,{\textit{AIPW}}}\) in (10) and \({\tilde{\theta }}_{j,{\textit{AIPW}}}\) in (18) is that we use estimators \(\widehat{N}\), \(\widehat{p}_{ij}\) and \(\widehat{a}_{ij}\) for \(\widehat{\theta }_{j,{\textit{AIPW}}}\) but use true values N, \(p_{ij}\) and \(a_{ij}\) for \({\tilde{\theta }}_{j,{\textit{AIPW}}}\). The difference between \(\widehat{\theta }_{j,{\textit{AIPW}}}\) and \({\theta }^\dag _{j,{\textit{AIPW}}}\) in (19) is that we use \(\widehat{N}\) for \(\widehat{\theta }_{j,{\textit{AIPW}}}\), but use N for \({\theta }^\dag _{j,{\textit{AIPW}}}\).
First, we prove
Consider
where the first equality holds due to \(E(r_{ij}) =p_{ij}\). Next, we derive the variance of \({\tilde{\theta }}_{j,{\textit{AIPW}}}\). Specifically, we have
where \(S=\{I_i:i=1,\ldots ,N\}\).
Because \(E(r_{ij}) = p_{ij}\), we have
Thus, we have
where the last equality holds by Conditions C1, C2 and the strong law of large numbers (Athreya and Lahiri, 2006). Notice that
By the models (1)–(2) and Condition C4, we can show that \(N^{-1}\sum _{i=1}^Ny_i^2\) is asymptotically bounded. Thus, by Condition C7, we have
By (21)–(24), we have shown (20).
Next, we show that
Consider
Consider
where the asymptotic order in (28) holds due to Condition C7 and \(N^{-1}\sum _{i=1}^N(y_{ij}-a_{ij})^2\) is asymptotically bounded in probability since \(\{\epsilon _{ij}:j=1,\ldots ,N\}\) are independent and their variances are uniformly bounded. By (27) and (28), we have
Because the response model (7) for \(p_{ij}\) is assumed to be correctly specified, and \(p_{ij}\) is bounded away from 0 by Condition C6, we have \(\widehat{p}_{ij}^{-1}(p_{ij}-\widehat{p}_{ij}) = O_p(1)\) uniformly for \(i=1,\ldots ,N\). Thus, by (29), we have
Since \(\widehat{p}_{ij}=p_{ij}+o_p(1)\), we have
Since \(p_{ij}\ge p_{\mathrm {min}}>0\) by Condition C6, we have
uniformly for \(i=1,\ldots ,N\).
Thus, by Condition C2, (31) and (33), we have
where \(\pi _i\ge C_6^{-1}nN^{-1}\) for \(C_6>0\) by Condition C2, and we have assumed that the first n subjects are sampled. By (20), (34) and Theorem 2 and the fact that \(L\le n\), we have proved (25).
By Condition C4 and the fact that \(E(\epsilon ^2_{ij})<\sigma _0^2\) uniformly, \(\theta _j\) is uniformly bounded for \(j=1,\ldots ,L\) in probability. Thus, by (20) and (25), we conclude that
By Condition C7, we conclude that \(\widehat{N}N^{-1}=1+O_p(n^{-1/2})\). Consider
where the first equality holds since \(\widehat{N}N^{-1}=1+O_p(n^{-1/2})\) uniformly for \(j=1,\ldots ,L\), and the second equality holds by (35). Thus, by (34) and (36), we have proved Theorem 3. \(\square\)
D Plug-in variance estimators
When deriving the plug-in variance estimator, we ignore the variability for estimating \(\widehat{a}_{ij}\). First, we consider the plug-in variance estimator under Poisson sampling. For \(j=1,\ldots ,L\), let
be the estimating function for the AIPW estimator with \(\widehat{a}_{ij}\) replaced by \({a}_{ij}\). Let \({\tilde{\theta }}_{j,{\textit{AIPW}}}\) solves \(g_j(\theta )=0\), and we use a variance estimator of \({\tilde{\theta }}_{j,{\textit{AIPW}}}\) to approximate that of \(\widehat{\theta }_{j,{\textit{AIPW}}}\).
It can be shown that \({\tilde{\theta }}_{j,{\textit{AIPW}}}-\theta _j=O_p(n^{-1/2})\), so we have
where \(g_j'(\theta ) = -N^{-1}\sum _{i=1}^NI_i\pi _i^{-1}\) is the derivative of \(g_j(\theta )\). By a similar argument for (27)–(28), we can show that \(g_j'(\theta _j)\rightarrow -1\) in probability. Besides, by (37), we have
Thus, the variance of \({\tilde{\theta }}_{j,{\textit{AIPW}}}\) can be estimated by the one of \(g_j(\theta _j)\).
Consider
Since \(E(r_{ij}) = p_{ij}\), we have
Thus, we have
and it can be estimated by
Notice that
Under Poisson sampling,
Thus,
and it can be estimated by
By (38)–(42) and plugging in \(\widehat{a}_{ij}\) and \(\widehat{\theta }_{j,{\textit{AIPW}}}\) for \(a_{ij}\) and \(\theta _j\), respectively, the plug-in variance estimator for \({\widehat{\theta }}_{j,{\textit{AIPW}}}\) is
under Poisson sampling.
Use a similar argument, we can show that the plug-in variance estimator is
under simple random sampling, and the one is
under probability-proportional-to-size sampling, where S is the index set of the sample, and \(q_i\) is the selection probability of the ith element.
E Balanced repeated replication method
Consider a stratified multi-stage sampling design with two clusters selected per stratum for the first stage. Denote \(w_{hik}\) to be the survey weight associated with \(y_{hik}\), the kth sample element in the ith cluster of the hth stratum. The basic idea of the modified balanced repeated replication is to use the same estimation method based on the “reconstruct” the survey weight, that is,
where \(\epsilon \in (0,1)\) is a predefined constant, and \(\delta _{rh}=1\) or \(\delta _{rh} = -1\) for the rth repetition. A set of R repetitions is said to be balanced if \(\sum _{r=1}^R\delta _{rh}\delta _{rh'}=0\) for \(h\ne h'\). The \(R\times H\) matrix \((\delta _{rh})_{R\times H}\) can be obtained from a Hadamard matrix. Please check Rao and Shao (1999) for details about the modified balanced repeated replication method. In the simulation study and real data application, we choose \(\epsilon = 1/2\) and \(R =H\).
About this article
Cite this article
Mao, X., Wang, Z. & Yang, S. Matrix completion under complex survey sampling. Ann Inst Stat Math 75, 463–492 (2023). https://doi.org/10.1007/s10463-022-00851-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-022-00851-5