Skip to main content
Log in

Matrix completion under complex survey sampling

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

Multivariate nonresponse is often encountered in complex survey sampling, and simply ignoring it leads to erroneous inference. In this paper, we propose a new matrix completion method for complex survey sampling. Different from existing works either conducting row-wise or column-wise imputation, the data matrix is treated as a whole which allows for exploiting both row and column patterns simultaneously. A column-space-decomposition model is adopted incorporating a low-rank structured matrix for the finite population with easy-to-obtain demographic information as covariates. Besides, we propose a computationally efficient projection strategy to identify the model parameters under complex survey sampling. Then, an augmented inverse probability weighting estimator is used to estimate the parameter of interest, and the corresponding asymptotic upper bound of the estimation error is derived. Simulation studies show that the proposed estimator has a smaller mean squared error than other competitors, and the corresponding variance estimator performs well. The proposed method is applied to assess the health status of the U.S. population.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Alaya, M. Z., Klopp, O. (2019). Collective matrix completion. Journal of Machine Learning Research, 20(148), 1–43.

    MathSciNet  MATH  Google Scholar 

  • Andridge, R. R., Little, R. J. (2010). A review of hot deck imputation for survey non-response. International Statistical Review, 78(1), 40–64.

    Article  Google Scholar 

  • Athreya, K. B., Lahiri, S. N. (2006). Measure theory and probability theory. New York: Springer.

    MATH  Google Scholar 

  • Bi, X., Qu, A., Wang, J., Shen, X. (2017). A group-specific recommender system. Journal of the American Statistical Association, 112(519), 1344–1353.

    Article  MathSciNet  Google Scholar 

  • Cai, T. T., Zhou, W.-X. (2016). Matrix completion via max-norm constrained optimization. Electronic Journal of Statistics, 10(1), 1493–1525.

    Article  MathSciNet  MATH  Google Scholar 

  • Candès, E. J., Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6), 717–772.

    Article  MathSciNet  MATH  Google Scholar 

  • Carpentier, A., Kim, A. K. (2018). An iterative hard thresholding estimator for low rank matrix recovery with explicit limiting distribution. Statistica Sinica, 28, 1371–1393.

    MathSciNet  MATH  Google Scholar 

  • Chang, T., Kott, P. S. (2008). Using calibration weighting to adjust for nonresponse under a plausible model. Biometrika, 95, 555–571.

    Article  MathSciNet  MATH  Google Scholar 

  • Chen, J., Shao, J. (2000). Nearest neighbor imputation for survey data. Journal of Official Statistics, 16, 113–131.

    Google Scholar 

  • Chen, T. C., Clark, J., Riddles, M. K., Mohadjer, L. K., Fakhouri, T. H. I. (2020). National health and nutrition examination survey, 2015–2018: Sample design and estimation procedures. National Center for Health Statistics. Vital Health Stat, 2(184), 1–26.

    Google Scholar 

  • Chen, Y., Fan, J., Ma, C., Yan, Y. (2019). Inference and uncertainty quantification for noisy matrix completion. Proceedings of the National Academy of Sciences, 116(46), 22931–22937.

    Article  MathSciNet  MATH  Google Scholar 

  • Chen, Y., Li, P., Wu, C. (2020). Doubly robust inference with nonprobability survey samples. Journal of the American Statistical Association, 115(532), 2011–2021.

    Article  MathSciNet  MATH  Google Scholar 

  • Clogg, C. C., Rubin, D. B., Schenker, N., Schultz, B., Weidman, L. (1991). Multiple imputation of industry and occupation codes in census public-use samples using Bayesian logistic regression. Journal of the American Statistical Association, 86, 68–78.

    Article  Google Scholar 

  • Davenport, M. A., Romberg, J. (2016). An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4), 608–622.

    Article  Google Scholar 

  • Davenport, M. A., Plan, Y., van den Berg, E., Wootters, M. (2014). 1-bit matrix completion. Information and Inference, 3(3), 189–223.

    Article  MathSciNet  MATH  Google Scholar 

  • Elliott, M. R., Valliant, R. (2017). Inference for nonprobability samples. Statistical Science, 32(2), 249–264.

    Article  MathSciNet  MATH  Google Scholar 

  • Fan, J., Gong, W., Zhu, Z. (2019). Generalized high-dimensional trace regression via nuclear norm regularization. Journal of Econometrics, 212(1), 177–202.

    Article  MathSciNet  MATH  Google Scholar 

  • Fay, R. E. (1992). When are inferences from multiple imputation valid? Proceedings of the survey research methods section of the American Statistical Association, 227–232. American Statistical Association.

  • Fletcher Mercaldo, S., Blume, J. D. (2018). Missing data and prediction: The pattern submodel. Biostatistics, 21(2), 236–252.

    Article  MathSciNet  Google Scholar 

  • Foucart, S., Needell, D., Plan, Y., Wootters, M. (2017). De-biasing low-rank projection for matrix completion. Wavelets and sparsity XVII, Vol. 10394, p. 1039417. International Society for Optics and Photonics.

  • Fuller, W. A. (2009). Sampling statistics. Hoboken, NJ: Wiley.

    Book  MATH  Google Scholar 

  • Fuller, W. A., Kim, J. K. (2005). Hot deck imputation for the response model. Survey Methodology, 31, 139.

    Google Scholar 

  • Harchaoui, Z., Douze, M., Paulin, M., Dudik, M., Malick, J. (2012). Large-scale image classification with trace-norm regularization. 2012 IEEE conference on computer vision and pattern recognition, 3386–3393. IEEE.

  • Horvitz, D. G., Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663–685.

    Article  MathSciNet  MATH  Google Scholar 

  • Isaki, C. T., Fuller, W. A. (1982). Survey design under the regression superpopulation model. Journal of the American Statistical Association, 77, 89–96.

    Article  MathSciNet  MATH  Google Scholar 

  • Keiding, N., Louis, T. A. (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys. Journal of the Royal Statistical Society: Series A (Statistics in Society), 179, 319–376.

    Article  MathSciNet  Google Scholar 

  • Kim, E., Lee, M., Oh, S. (2015). Elastic-net regularization of singular values for robust subspace learning. Proceedings of the IEEE conference on computer vision and pattern recognition, 915–923.

  • Kim, J. K., Fuller, W. (2004). Fractional hot deck imputation. Biometrika, 91, 559–578.

    Article  MathSciNet  MATH  Google Scholar 

  • Kim, J. K., Yu, C. L. (2011). A semiparametric estimation of mean functionals with nonignorable missing data. Journal of the American Statistical Association, 106, 157–165.

    Article  MathSciNet  MATH  Google Scholar 

  • Kim, J. K., Brick, J., Fuller, W. A., Kalton, G. (2006). On the bias of the multiple-imputation variance estimator in survey sampling. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 509–521.

    Article  MathSciNet  MATH  Google Scholar 

  • Koltchinskii, V., Lounici, K., Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Annals of Statistics, 39(5), 2302–2329.

    Article  MathSciNet  MATH  Google Scholar 

  • Li, H., Chen, N., Li, L. (2012). Error analysis for matrix elastic-net regularization algorithms. IEEE Transactions on Neural Networks and Learning Systems, 23(5), 737–748.

    Article  Google Scholar 

  • Liu, W., Mao, X., Wong, R. K. W. (2020). Median matrix completion: From embarrassment to optimality. Proceedings of the 37th International Conference on Machine Learning, Vol. 119, 294–6304.

  • Mao, X., Chen, S. X., Wong, R. K. (2019). Matrix completion with covariate information. Journal of the American Statistical Association, 114(525), 198–210.

    Article  MathSciNet  MATH  Google Scholar 

  • Mao, X., Wong, R. K., Chen, S. X. (2021). Matrix completion under low-rank missing mechanism. Statistica Sinica, 31(4), 2005–2030.

    MathSciNet  MATH  Google Scholar 

  • Mazumder, R., Hastie, T., Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11, 2287–2322.

    MathSciNet  MATH  Google Scholar 

  • Meng, X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 9, 538–558.

    Google Scholar 

  • Molenberghs, G., Michiels, B., Kenward, M. G., Diggle, P. J. (1998). Monotone missing data and pattern-mixture models. Statistica Neerlandica, 52(2), 153–161.

    Article  MathSciNet  MATH  Google Scholar 

  • Negahban, S., Wainwright, M. J. (2012). Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. Journal of Machine Learning Research, 13(1), 1665–1697.

    MathSciNet  MATH  Google Scholar 

  • Nielsen, S. F. (2003). Proper and improper multiple imputation. International Statistical Review, 71, 593–607.

    Article  MATH  Google Scholar 

  • Pfeffermann, D. (1993). The role of sampling weights when modeling survey data. International Statistical Review, 61, 317–337.

    Article  MATH  Google Scholar 

  • Qin, J., Zhang, B., Leung, D. H. (2017). Efficient augmented inverse probability weighted estimation in missing data problems. Journal of Business & Economic Statistics, 35(1), 86–97.

    Article  MathSciNet  Google Scholar 

  • Rao, J. N. K., Shao, J. (1999). Modified balanced repeated replication for complex survey data. Biometrika, 86(2), 403–415.

    Article  MathSciNet  MATH  Google Scholar 

  • Robin, G., Klopp, O., Josse, J., Moulines, É., Tibshirani, R. (2020). Main effects and interactions in mixed and incomplete data frames. Journal of the American Statistical Association, 115(531), 1292–1303.

    Article  MathSciNet  MATH  Google Scholar 

  • Robins, J. M., Rotnitzky, A., Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89, 846–866.

    Article  MathSciNet  MATH  Google Scholar 

  • Robins, J. M., Rotnitzky, A., Zhao, L. P. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association, 90, 106–121.

    Article  MathSciNet  MATH  Google Scholar 

  • Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.

    Article  MathSciNet  MATH  Google Scholar 

  • Rubin, D. B. (1978). Multiple imputations in sample surveys—a phenomenological Bayesian approach to nonresponse. Proceedings of the survey research methods section of the American Statistical Association, Vol. 1, 20–34. American Statistical Association.

  • Sengupta, N., Srebro, N., Evans, J. (2021). Simple surveys: Response retrieval inspired by recommendation systems. Social Science Computer Review, 39(1), 105–129.

    Article  Google Scholar 

  • Sun, T., Zhang, C.-H. (2012). Calibrated elastic regularization in matrix completion. Advances in Neural Information Processing Systems, 25, 863–871.

    Google Scholar 

  • Sweeting, T. (1980). Uniform asymptotic normality of the maximum likelihood estimator. Annals of Statistics, 1375–1381.

  • Tan, Z. (2013). Simple design-efficient calibration estimators for rejective and high-entropy sampling. Biometrika, 100(2), 399–415.

    Article  MathSciNet  MATH  Google Scholar 

  • Tang, G., Little, R. J., Raghunathan, T. E. (2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika, 90, 747–764.

    Article  MathSciNet  MATH  Google Scholar 

  • van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67.

  • van der Linden, W. J., Hambleton, R. K. (2013). Handbook of modern item response theory. New York, NY: Springer.

    MATH  Google Scholar 

  • Wang, N., Robins, J. M. (1998). Large-sample theory for parametric multiple imputation procedures. Biometrika, 85, 935–948.

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, S., Shao, J., Kim, J. K. (2014). An instrument variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, 24(3), 1097–1116.

    MathSciNet  MATH  Google Scholar 

  • Wang, Z., Peng, L., Kim, J. K. (2022). Bootstrap inference for the finite population mean under complex sampling designs. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Accepted.

  • Wu, C. (2003). Optimal calibration estimators in survey sampling. Biometrika, 90(4), 937–951.

    Article  MathSciNet  MATH  Google Scholar 

  • Yang, S., Kim, J. K. (2016). A note on multiple imputation for method of moments estimation. Biometrika, 103(1), 244–251.

    Article  MathSciNet  MATH  Google Scholar 

  • Yang, S., Wang, L., Ding, P. (2019). Causal inference with confounders missing not at random. Biometrika, 106(4), 875–888.

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, C., Taylor, S. J., Cobb, C., Sekhon, J. (2020). Active matrix factorization for surveys. Annals of Applied Statistics, 14(3), 1182–1206.

    Article  MathSciNet  MATH  Google Scholar 

  • Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We are grateful to two referees and the Associate Editor for their constructive comments which have greatly improved the paper. Mao is partially supported by NSFC (No.: 12001109, 92046021) and the Science and Technology Commission of Shanghai Municipality grant 20dz1200600. Wang is partially supported by NSFC (No.: 11901487, 72033002) and the Fundamental Scientific Center of National Natural Science Foundation of China Grant No. 71988101. Yang is partly supported by the NSF DMS 1811245, NIH 1R01AG066883 and 1R01ES031651.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhonglei Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 218 KB)

Appendices

Appendix

A Technical conditions

The technical conditions needed for our analysis are given as follows.

  1. C1

    (a) The random errors \(\{\epsilon _{ij}:i=1,\ldots ,N;j=1,\ldots ,L\}\) in (2) are independently distributed random variables such that \(E(\epsilon _{ij})=0\) and \(E(\epsilon ^2_{ij})=\sigma _{ij}^2<\infty\) for all ij. (b) For some finite positive constants \(c_{\sigma }\) and \(\eta\), \({\max }_{i,j}E|\epsilon _{ij}|^{l}\le \frac{1}{2}l!c_{\sigma }^{2}\eta ^{l-2}\) for any positive integer \(l\ge 2\).

  2. C2

    The inclusion probability satisfies \(\pi _i\asymp nN^{-1}\) for \(i=1,\ldots ,N\).

  3. C3

    The population design matrix \({\varvec{X}}_N\) is of size \(N\times d\) such that \(N>d\). Moreover, there exists a positive constant \(a_{x}\) such that \(\Vert {\varvec{X}}_N\Vert _{\infty }\le a_{x}\) and \({\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N\) is invertible, where \({\varvec{D}}_N\) is a diagonal matrix with \(\pi _i\) as its (ii)th entry. Furthermore, there exists a symmetric matrix \({\varvec{S}}_{{\varvec{X}}}\) with \(\sigma _{\min }({\varvec{S}}_{{\varvec{X}}})\asymp 1\asymp \Vert {\varvec{S}}_{{\varvec{X}}}\Vert\) such that \(n_0^{-1}{\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N\rightarrow {\varvec{S}}_{{\varvec{X}}}\) as \(N\rightarrow \infty\), where \(n_0=\sum _{i=1}^N\pi _i\) is the expected sample.

  4. C4

    There exists a positive constant a such that \(\max \{\Vert {\varvec{X}}_N\varvec{\beta }^{*}\Vert _{\infty },\Vert {\varvec{A}}_N\Vert _{\infty }\}\le a\).

  5. C5

    The indicators of observed entries \(\{r_{ij}:i=1,\ldots ,N;j=1,\ldots ,L\}\) are mutually independent, \(r_{ij}\sim \text {Bern}(p_{ij})\) for \(p_{ij}\in (0,1)\) and are independent of \(\{\epsilon _{ij}\}_{i,j=1}^{N,L}\) given \({\varvec{X}}_N\). Furthermore, for \(i=1,\dots ,N\) and \(j=1,\dots ,L\), \(\Pr (r_{ij}=1 | {\varvec{x}}_{i}, y_{ij}) = \Pr (r_{ij}=1 | {\varvec{x}}_{i})\) follows the logistic regression model (7).

  6. C6

    There exists a lower bound \(p_{\min }\in (0,1)\) such that \({\min }_{i,j}\{p_{ij}\}\ge p_{\min }>0\), where \(p_{\min }\) is allowed to depend on n and L. The number of questions \(L\le n\).

  7. C7

    The sampling design satisfies that \(N^{-1}\sum _{i=1}^Ny_i\pi _i^{-1} = O_p(n^{-1/2})\) if \(N^{-1}\sum _{i=1}^Ny_i^{2}\) is asymptotically bounded.

Condition C1(a) is a common regularity condition for the measurement errors in \(\varvec{\epsilon }_{N}\), and C1(b) is the Bernstein condition (Koltchinskii et al., 2011). Condition C2 is widely used in survey sampling and regulates the inclusion probabilities of a sampling design (Fuller, 2009). In Condition C3, the requirement \(N>d\) is easily met as the number of questions in a survey is usually fixed, and the population size is often larger than the number of questions. As the dimension of \(n_0^{-1}{\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N\) is fixed at \(d\times d\), it is mild to assume \({\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N\) to be invertible, and there exists a symmetric matrix \({\varvec{S}}_{{\varvec{X}}}\) as the limit of \(n_0^{-1}{\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N\). Please notice that we do not assume randomness for generating \({\varvec{X}}_N\), and it is a common assumption for design-based framework. Furthermore, the sample size is often larger than the number of questions, that is, \(n>d\), and it is not hard to show that together with Condition C2, the probability limit of \(n^{-1}{\varvec{X}}_n^{\mathrm{T}}{\varvec{X}}_n\) is also \({\varvec{S}}_{{\varvec{X}}}\) under regularity conditions. The order of \(\sigma _{\min }({\varvec{S}}_{{\varvec{X}}})\) and \(\Vert {\varvec{S}}_{{\varvec{X}}}\Vert\) equals to 1 is due to \(\Vert {\varvec{X}}_N\Vert _{\infty }<\infty\). Condition C4 is also standard in the matrix completion literature (Cai and Zhou, 2016; Koltchinskii et al., 2011; Negahban and Wainwright, 2012). Especially, it is reasonable to assume all the responses are bounded in survey sampling. Condition C5 describes the independent Bernoulli model for the response indicator of observing \(y_{ij}\), where the probability of observation \(p_{ij}\) follows the logistic model (7). In Condition C6, the lower bound \(p_{\min }\) is allowed to go to 0 with n and L growing. This condition is more general than we need for a typical survey, and \(p_{\min }\asymp 1\) suffices. Typically, the number of questions L grows slower than the number of participants n in survey sampling. Thus, the assumption that \(L\le n\) is quite mild. Condition C7 is a mild restriction on the estimator for the population mean, and it can be satisfied under general sampling designs. To get general results, we do not make any assumptions for the asymptotic relationship between the population size N and the sample size n; see Theorem 1 for details. We can make further assumptions for the sample sizes to guarantee certain convergence properties; see the discussion of Theorem 3.

B Lemmas

Under the logistic model (7), together with the results in Mao et al. (2019) and Sweeting (1980), it can been shown that for all \(t>d+3\), there exist some positive constants \(C_g\), C and \(C_{d}\) such that \(\Pr \{\sum _{ij}(1/\widehat{p}_{ij}-1/p_{ij})^{2}\ge C_g p_{\min }^{-3}t\}\le C_{d}t\exp \{-t/2\}+L\max _{j}\sup _{t}|\Pr \{\sum _{i}(1/\widehat{p}_{ij}-1/p_{ij})^{2}\ge t\}-\Pr (\chi ^{2}_{d+1}\ge C_g p_{\min }^{-3}t)|\). Then, \(\max _{j}\sup _{t}|\Pr \{\sum _{i}(1/\widehat{p}_{ij}-1/p_{ij})^{2}\ge t\}-\Pr (\chi ^{2}_{d+1}\ge C_g p_{\min }^{-3}t)|\le L^{-2}\) and \(C_{d}t\exp \{-t/2\}\) is a function independent of n and L such that \(\lim _{t\rightarrow \infty }t\exp \{-t/2\}= 0\).

Write \({\varvec{J}}_{ij}={\varvec{e}}_{i}(n_1){\varvec{e}}^\intercal _{j}(n_2)\), where \({\varvec{e}}_{i}(n)\in {\mathbb {R}}^n\) is the standard basis vector with the i-th element being 1 and the rest being 0. Now we present several lemmas.

Lemma 1

Under Conditions C2 and C3 and Poisson sampling, we have

$$\begin{aligned} n^{-1}{\varvec{X}}_n^{\mathrm{T}}{\varvec{X}}_n={\varvec{S}}_{{\varvec{X}}}+ o_p(1). \end{aligned}$$
(15)

Proof of Lemma 1

Denote \({\varvec{e}}_i\) to be a column vector of length d with jth element being 1 and others being 0. Recall that \(n_0 = \sum _{k=1}^N\pi _k\) is the expected sample size. For \(i=1,\ldots ,d\) and \(j=1,\ldots ,d\), consider

$$\begin{aligned} E( n_0^{-1}{\varvec{e}}_i^{\mathrm{T}}{\varvec{X}}_n^{\mathrm{T}}{\varvec{X}}_n{\varvec{e}}_j)&= n_0^{-1}E\left( \sum _{k=1}^NI_kx_{ki}x_{kj}\right) = n_0^{-1}\sum _{i=1}^Nx_{ki}\pi _kx_{kj} \nonumber \\&= n_0^{-1}{\varvec{e}}_i^{\mathrm{T}}{\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N{\varvec{e}}_j, \end{aligned}$$
(16)

where the expectation is take with respect to the sampling design, and \({\varvec{x}}_i\) is the ith row of \({\varvec{X}}_N\).

Under Poisson sampling, we have

$$\begin{aligned} \mathrm {var}( n_0^{-1}{\varvec{e}}_i^{\mathrm{T}}{\varvec{X}}_n^{\mathrm{T}}{\varvec{X}}_n{\varvec{e}}_j)&= n_0^{-2}\sum _{k=1}^N\pi _k(1-\pi _k)x_{k,i}x_{k,j}\nonumber \\ &< n_0^{-2}\sum _{k=1}^N\pi _kx_{k,i}x_{k,j}\nonumber \\ &= n_0^{-1}\left( n_0^{-1}{\varvec{e}}_i^{\mathrm{T}}{\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N{\varvec{e}}_j\right) . \end{aligned}$$
(17)

By a similar argument in Wang et al. (2019), we can show that \(n/n_0\rightarrow 1\) in probability as \(N\rightarrow \infty\) under Condition C2. By Condition C3, (16) and (17), we have proved Lemma 1. \(\square\)

Lemma 2

Let \(\Psi ^{(1)}=\sum _{ij}r_{ij}\epsilon _{ij}{\varvec{J}}_{ij}/(nL\widehat{p}_{ij}\pi _{ij}^{1/2})\). Under Conditions C1, C5 and C6 and Poisson sampling, for some positive constants \(C_1\), \(c_{\sigma }\), \(\eta\), \(\delta _{\sigma }\) and all \(t>d+3\), we have

$$\begin{aligned}&\left\Vert \Psi ^{(1)}\right\Vert \\&\quad \le C_1\max \left\{ N^{1/2}n^{-1}L^{-1}\log ^{1/2}\left( n\right) p_{\min }^{-1/2},N^{1/2}n^{-5/4}L^{-1/4}\log ^{1/2}\left( L\right) \log ^{\delta _{\sigma }/4}\left( n\right) t^{1/2}p_{\min }^{-3/2}\right\} \end{aligned}$$

holds with probability at least \(1-1/(n+L)-C_{d}t\exp \{-t/2\}-1/L-12c_{\sigma }^{2}\eta ^{2}\log ^{-\delta _{\sigma }}(n)\).

Lemma 3

Let \(\Psi ^{(2)}=\sum _{ij}a_{ij}(r_{ij}/p_{ij}-1){\varvec{J}}_{ij}/(nL\pi _{ij}^{1/2})\). Under Conditions C4–C6 and Poisson sampling, for some positive constants \(C_2\), we have

$$\begin{aligned} \left\Vert \Psi ^{(2)}\right\Vert \le C_2N^{1/2}n^{-1}L^{-1}\log ^{1/2}\left( n\right) p_{\min }^{-1/2} \end{aligned}$$

holds with probability at least \(1-1/(n+L)\).

Lemma 4

Let \(\Psi ^{(3)}=\sum _{ij}a_{ij}(r_{ij}/{\widehat{p}}_{ij}-r_{ij}/p_{ij}){\varvec{J}}_{ij}/(nL\pi _{ij}^{1/2})\). Under Conditions C4 and C6 and Poisson sampling, for some positive constants \(C_3\), \(\delta _{\sigma }\) and all \(t>d+3\), we have

$$\begin{aligned} \left\Vert \Psi ^{(3)}\right\Vert \le C_3N^{1/2}n^{-5/4}L^{-1/4}\log ^{1/2}\left( L\right) \log ^{\delta _{\sigma }/4}\left( n\right) t^{1/2}p_{\min }^{-3/2} \end{aligned}$$

holds with probability at least \(1-C_{d}t\exp \{-t/2\}-1/L\).

It is easy to show Lemma 24 by the proof of Lemma S4.1–S4.3 in the supplementary material of Mao et al. (2019).

C Proofs

1.1 C.1 Proof of Theorem 1

With the definition of \(\Delta (\delta _{\sigma },t)\) in (12), under Conditions C1–C6 and Poisson sampling, together with Lemmas 24, We have for a positive constant \(C_0\),

$$\begin{aligned} \left\Vert \Psi ^{(1)}\right\Vert +\Vert \Psi ^{(2)}\Vert +\left\Vert \Psi ^{(3)}\right\Vert \le C_{0} \Delta (\delta _{\sigma },t), \end{aligned}$$

with probability at least \(1-2/n-2C_{d}t\exp \{-t/2\}-2/L-12c_{\sigma }^{2}\eta ^{2}\log ^{-\delta _{\sigma }}(n)\).

Let \({\varvec{X}}_n^{\prime }={\varvec{D}}_n^{-1/2}{\varvec{X}}_n\) and \(\eta _{n,L}(\delta _{\sigma },t) = 4/(n+L)+ 4C_{d}t\exp \{-t/2\}+4/L+C\log ^{-\delta _{\sigma }}(n)\) for a positive constant C. By choosing t as (13), \(\tau _1\asymp N^{-1}nL^{-1}\log ^{-1/2}(n)\Delta (\delta _{\sigma })\) and \(\tau _2\asymp \eta _{g}^{-1/2}N^{-1}n^{1/4}L^{-1/4}\log ^{1/2}(L)\log ^{\delta _{\sigma }/3}(n)\), where \(\Delta (\delta _{\sigma })=N^{1/2}n^{-1}L^{-1}\log ^{1/2}(n)p_{\min }^{-1/2}\) and \(1-\alpha \asymp (nL)^{-1}\) in (8) for any \(\delta _{\sigma }>0\), together with Condition C2 and Poisson sampling, it follows the same proof with the proof of Corollary 1 in Mao et al. (2019) that, for some constants \(C_1\) and \(C_2\), with probability at least \(1-\eta _{n,L}(\delta _{\sigma },t)\),

$$\begin{aligned} \text {both}\quad \frac{1}{nL}\left\Vert {\varvec{X}}_n{\widehat{\varvec{\beta }}}^{\prime }-{\varvec{X}}_n\varvec{\beta }^{*\prime }\right\Vert _F^{2} \quad \text {and}\quad \frac{1}{nL}\left\Vert {\widehat{{\varvec{B}}}}_n^{\prime }-{\varvec{B}}_n^{*\prime }\right\Vert _F^{2}\le C_1r_{{\varvec{B}}_N} Nn^{-1}L^{-1}\log \left( n\right) p_{\min }^{-1}. \end{aligned}$$

Thus it is easy to obtain that

$$\begin{aligned} \frac{1}{mL}\left\Vert {\widehat{\varvec{\beta }}}^{\prime }-\varvec{\beta }^{*\prime }\right\Vert _F^{2}\le C_2r_{{\varvec{B}}_N} L^{-1}\log \left( n\right) p_{\min }^{-1}, \end{aligned}$$

under Condition C3. \(\square\)

1.2 C.2 Proof of Theorem 2

Due to the observations that

$$\begin{aligned} \left\Vert {\widehat{{\varvec{A}}}}_n-{\varvec{A}}_n\right\Vert _{F}^2\le \left\Vert {\varvec{X}}_n{\widehat{\varvec{\beta }}}^{\prime }-{\varvec{X}}_n\varvec{\beta }^{*\prime }\right\Vert _F^{2}+\left\Vert {\varvec{D}}_n^{-1/2}{\widehat{{\varvec{B}}}}_n^{\prime }-{\varvec{D}}_n^{-1/2}{\varvec{B}}_n^{*\prime }\right\Vert _F^{2}, \end{aligned}$$

together with Theorem 1, it is easy to obtain the result under Condition C2 and Poisson sampling.\(\square\)

1.3 C.3 Proof of Theorem 3

Denote

$$\begin{aligned}&{\tilde{\theta }}_{j,{\textit{AIPW}}} = N^{-1}\sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij} - {a}_{ij})}{{p}_{ij}} + {a}_{ij}\right\} , \end{aligned}$$
(18)
$$\begin{aligned}&{\theta }^\dag _{j,{\textit{AIPW}}} = N^{-1}\sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij} - \widehat{a}_{ij})}{{p}_{ij}} + \widehat{a}_{ij}\right\} \end{aligned}$$
(19)

for \(j=1,\ldots ,L\). The difference between \(\widehat{\theta }_{j,{\textit{AIPW}}}\) in (10) and \({\tilde{\theta }}_{j,{\textit{AIPW}}}\) in (18) is that we use estimators \(\widehat{N}\), \(\widehat{p}_{ij}\) and \(\widehat{a}_{ij}\) for \(\widehat{\theta }_{j,{\textit{AIPW}}}\) but use true values N, \(p_{ij}\) and \(a_{ij}\) for \({\tilde{\theta }}_{j,{\textit{AIPW}}}\). The difference between \(\widehat{\theta }_{j,{\textit{AIPW}}}\) and \({\theta }^\dag _{j,{\textit{AIPW}}}\) in (19) is that we use \(\widehat{N}\) for \(\widehat{\theta }_{j,{\textit{AIPW}}}\), but use N for \({\theta }^\dag _{j,{\textit{AIPW}}}\).

First, we prove

$$\begin{aligned} {\tilde{\theta }}_{j,{\textit{AIPW}}} - \theta _j = O_p(n^{-1/2}). \end{aligned}$$
(20)

Consider

$$\begin{aligned} E( {\tilde{\theta }}_{j,{\textit{AIPW}}}) = E\{E({\tilde{\theta }}_{j,{\textit{AIPW}}}\mid \{I_i\})\} = N^{-1}\sum _{i=1}^N\frac{E(I_i)}{\pi _i}(y_{ij}) = \theta _j, \end{aligned}$$
(21)

where the first equality holds due to \(E(r_{ij}) =p_{ij}\). Next, we derive the variance of \({\tilde{\theta }}_{j,{\textit{AIPW}}}\). Specifically, we have

$$\begin{aligned} \mathrm {var}({\tilde{\theta }}_{j,{\textit{AIPW}}})&= \frac{1}{N^2}E\left( \mathrm {var}\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}\right\} \mid S\right] \right) \nonumber \\&\quad + \frac{1}{N^2}\mathrm {var}\left( E\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}\right\} \mid S\right] \right) = V_{1,j} + V_{2,j}, \end{aligned}$$
(22)

where \(S=\{I_i:i=1,\ldots ,N\}\).

Because \(E(r_{ij}) = p_{ij}\), we have

$$\begin{aligned} \mathrm {var}\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}\right\} \mid S\right] = \sum _{i=1}^N\frac{I_i(1-p_{ij})}{\pi _i^2p_{ij}}(y_{ij}-a_{ij})^2. \end{aligned}$$

Thus, we have

$$\begin{aligned} V_{1,j} = \frac{1}{N^2}\sum _{i=1}^N\frac{1-p_{ij}}{\pi _{i}}(y_{ij}-a_{ij})^2=O_p(n^{-1}), \end{aligned}$$
(23)

where the last equality holds by Conditions C1, C2 and the strong law of large numbers (Athreya and Lahiri, 2006). Notice that

$$\begin{aligned} E\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}\right\} \mid S\right] =\sum _{i=1}^N\frac{I_iy_{ij}}{\pi _i}. \end{aligned}$$

By the models (1)–(2) and Condition C4, we can show that \(N^{-1}\sum _{i=1}^Ny_i^2\) is asymptotically bounded. Thus, by Condition C7, we have

$$\begin{aligned} V_{2,j}=O_p(n^{-1}) \end{aligned}$$
(24)

By (21)–(24), we have shown (20).

Next, we show that

$$\begin{aligned} \frac{1}{L}\sum _{j=1}^L({\theta }^\dag _{j,{\textit{AIPW}}} - {\tilde{\theta }}_{ j,{\textit{AIPW}}})^2=O_p\{r_{{\varvec{B}}_N}L^{-1}\log \left( n\right) \}. \end{aligned}$$
(25)

Consider

$$\begin{aligned} {\theta }^\dag _{j,{\textit{AIPW}}} - {\tilde{\theta }}_{ j,{\textit{AIPW}}} = \frac{1}{N}\sum _{i=1}^N\frac{I_i}{\pi _{i}}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})(p_{ij}-\widehat{p}_{ij})}{p_{ij}\widehat{p}_{ij}} + \frac{(r_{ij}-\widehat{p}_{ij})(a_{ij}-\widehat{a}_{ij})}{\widehat{p}_{ij}}\right\} . \end{aligned}$$
(26)

Consider

$$\begin{aligned} E\left\{ \frac{1}{N}\sum _{i\in S}\pi _{i}^{-1}\frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} \right\}&= \frac{1}{N}\sum _{i=1}^N(y_{ij}-a_{ij})=O_p(N^{-1/2}), \end{aligned}$$
(27)
$$\begin{aligned} \mathrm {var}\left\{ \frac{1}{N}\sum _{i\in S}\pi _{i}^{-1}\frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} \right\}&= O_p(n^{-1}), \end{aligned}$$
(28)

where the asymptotic order in (28) holds due to Condition C7 and \(N^{-1}\sum _{i=1}^N(y_{ij}-a_{ij})^2\) is asymptotically bounded in probability since \(\{\epsilon _{ij}:j=1,\ldots ,N\}\) are independent and their variances are uniformly bounded. By (27) and (28), we have

$$\begin{aligned} \frac{1}{N}\sum _{i=1}^N\frac{I_i}{\pi _{i}}\frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} = O_p(n^{-1/2} ). \end{aligned}$$
(29)

Because the response model (7) for \(p_{ij}\) is assumed to be correctly specified, and \(p_{ij}\) is bounded away from 0 by Condition C6, we have \(\widehat{p}_{ij}^{-1}(p_{ij}-\widehat{p}_{ij}) = O_p(1)\) uniformly for \(i=1,\ldots ,N\). Thus, by (29), we have

$$\begin{aligned} N^{-1}\sum _{i=1}^N\frac{I_i}{\pi _{i}}\frac{r_{ij}(y_{ij}-a_{ij})(p_{ij}-\widehat{p}_{ij})}{p_{ij}\widehat{p}_{ij}} =O_p(n^{-1/2}). \end{aligned}$$
(30)

By (26) and (30), we have

$$\begin{aligned} {\theta }^\dag _{j,{\textit{AIPW}}} - {\tilde{\theta }}_{ j,{\textit{AIPW}}} = O_p(n^{-1/2}) + \frac{1}{N}\sum _{i=1}^N\frac{I_i}{\pi _i}\frac{(r_{ij}-\widehat{p}_{ij})(a_{ij}-\widehat{a}_{ij})}{\widehat{p}_{ij}}. \end{aligned}$$
(31)

Since \(\widehat{p}_{ij}=p_{ij}+o_p(1)\), we have

$$\begin{aligned} \frac{r_{ij}-\widehat{p}_{ij}}{\widehat{p}_{ij}}=\{1+o_p(1)\}\frac{r_{ij}-p_{ij}}{p_{ij}}. \end{aligned}$$
(32)

Since \(p_{ij}\ge p_{\mathrm {min}}>0\) by Condition C6, we have

$$\begin{aligned} \frac{r_{ij}-\widehat{p}_{ij}}{\widehat{p}_{ij}}= O_p(1) \end{aligned}$$
(33)

uniformly for \(i=1,\ldots ,N\).

Thus, by Condition C2, (31) and (33), we have

$$\begin{aligned} \frac{1}{L}\sum _{j=1}^L({\theta }^\dag _{j,{\textit{AIPW}}} - {\tilde{\theta }}_{ j,{\textit{AIPW}}})^2&\le O_p(n^{-1})+ \frac{C_6 O_p(1)}{Ln^2}\sum _{j=1}^L\left\{ \sum _{i=1}^n(a_{ij} - \widehat{a}_{ij}) \right\} ^2,\nonumber \\ &\le O_p(n^{-1})+ \frac{ O_p(1)}{Ln}\sum _{j=1}^L\sum _{i=1}^n(a_{ij} - \widehat{a}_{ij})^2\nonumber \\ &=O_p(n^{-1})+ \frac{O_p(1)}{nL}\left\Vert \widehat{A}_n-A_n\right\Vert _F^2, \end{aligned}$$
(34)

where \(\pi _i\ge C_6^{-1}nN^{-1}\) for \(C_6>0\) by Condition C2, and we have assumed that the first n subjects are sampled. By (20), (34) and Theorem 2 and the fact that \(L\le n\), we have proved (25).

By Condition C4 and the fact that \(E(\epsilon ^2_{ij})<\sigma _0^2\) uniformly, \(\theta _j\) is uniformly bounded for \(j=1,\ldots ,L\) in probability. Thus, by (20) and (25), we conclude that

$$\begin{aligned} \frac{1}{L}\sum _{j=1}^L({\theta }^\dag _{j,{\textit{AIPW}}})^2 = O_p\{r_{{\varvec{B}}_N}L^{-1}\log \left( n\right) \} \end{aligned}$$
(35)

By Condition C7, we conclude that \(\widehat{N}N^{-1}=1+O_p(n^{-1/2})\). Consider

$$\begin{aligned} \frac{1}{L}\sum _{j=1}^L(\widehat{\theta }_{j,{\textit{AIPW}}} - {\theta }^\dag _{ j,{\textit{AIPW}}})^2 = \frac{O_p(n^{-1})}{L}\sum _{j=1}^L({\theta }^\dag _{ j,{\textit{AIPW}}})^2 = o_p\{r_{{\varvec{B}}_N}L^{-1}\log \left( n\right) \}, \end{aligned}$$
(36)

where the first equality holds since \(\widehat{N}N^{-1}=1+O_p(n^{-1/2})\) uniformly for \(j=1,\ldots ,L\), and the second equality holds by (35). Thus, by (34) and (36), we have proved Theorem 3. \(\square\)

D Plug-in variance estimators

When deriving the plug-in variance estimator, we ignore the variability for estimating \(\widehat{a}_{ij}\). First, we consider the plug-in variance estimator under Poisson sampling. For \(j=1,\ldots ,L\), let

$$\begin{aligned} g_j(\theta ) = \frac{1}{N}\sum _{i=1}^{N}\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-{a}_{ij})}{{p}_{ij}}+{a}_{ij}-\theta \right\} \end{aligned}$$

be the estimating function for the AIPW estimator with \(\widehat{a}_{ij}\) replaced by \({a}_{ij}\). Let \({\tilde{\theta }}_{j,{\textit{AIPW}}}\) solves \(g_j(\theta )=0\), and we use a variance estimator of \({\tilde{\theta }}_{j,{\textit{AIPW}}}\) to approximate that of \(\widehat{\theta }_{j,{\textit{AIPW}}}\).

It can be shown that \({\tilde{\theta }}_{j,{\textit{AIPW}}}-\theta _j=O_p(n^{-1/2})\), so we have

$$\begin{aligned} 0 = g_j({\tilde{\theta }}_{j,{\textit{AIPW}}}) = g_j(\theta _j) + g_j'(\theta _j) ({\tilde{\theta }}_{j,{\textit{AIPW}}}-\theta _j) + o_p(n^{-1/2}), \end{aligned}$$
(37)

where \(g_j'(\theta ) = -N^{-1}\sum _{i=1}^NI_i\pi _i^{-1}\) is the derivative of \(g_j(\theta )\). By a similar argument for (27)–(28), we can show that \(g_j'(\theta _j)\rightarrow -1\) in probability. Besides, by (37), we have

$$\begin{aligned} ({\tilde{\theta }}_{j,{\textit{AIPW}}}-\theta _j) = -\{g_j'(\theta _j)\}^{-1}g_j(\theta _j) + o_p(n^{-1/2}) = g_j(\theta _j)+ o_p(n^{-1/2}). \end{aligned}$$

Thus, the variance of \({\tilde{\theta }}_{j,{\textit{AIPW}}}\) can be estimated by the one of \(g_j(\theta _j)\).

Consider

$$\begin{aligned} \mathrm {var}\{g_j(\theta _j)\}&= N^{-2}E\left( \mathrm {var}\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}-\theta _j\right\} \mid S\right] \right) \nonumber \\&\quad + N^{-2}\mathrm {var}\left( E\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}-\theta _j\right\} \mid S\right] \right) \nonumber \\&= V_{1,j} + V_{2,j}. \end{aligned}$$
(38)

Since \(E(r_{ij}) = p_{ij}\), we have

$$\begin{aligned} \mathrm {var}\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{m_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}-\theta _j\right\} \mid S\right] = \sum _{i=1}^N\frac{I_i(1-p_{ij})}{\pi _i^2p_{ij}}(y_{ij}-a_{ij})^2. \end{aligned}$$

Thus, we have

$$\begin{aligned} V_{1,j} = N^{-2}\sum _{i=1}^N\frac{1-p_{ij}}{\pi _ip_{ij}}(y_{ij}-a_{ij})^2, \end{aligned}$$
(39)

and it can be estimated by

$$\begin{aligned} \widehat{V}_{1,j} = N^{-2}\sum _{i=1}^N\frac{r_{ij}(1-p_{ij})}{\pi _i^2p_{ij}^2}(y_{ij}-a_{ij})^2. \end{aligned}$$
(40)

Notice that

$$\begin{aligned} E\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}-\theta _j\right\} \mid S\right] =\sum _{i=1}^N\frac{I_i}{\pi _i}(y_{ij}-\theta _j). \end{aligned}$$

Under Poisson sampling,

$$\begin{aligned} \mathrm {var}\left( E\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}-\theta _j\right\} \mid S\right] \right) = \sum _{i=1}^N\frac{1-\pi _i}{\pi _i}(y_{ij}-\theta _j)^2. \end{aligned}$$

Thus,

$$\begin{aligned} V_{2,j} = N^{-2} \sum _{i=1}^N\frac{1-\pi _i}{\pi _i}(y_{ij}-\theta _j)^2, \end{aligned}$$
(41)

and it can be estimated by

$$\begin{aligned} \widehat{V}_{2,j} = N^{-2}\sum _{i=1}^N\frac{I_ir_{ij}(1-\pi _i)}{p_{ij}\pi _i^2}y_{ij}^2. \end{aligned}$$
(42)

By (38)–(42) and plugging in \(\widehat{a}_{ij}\) and \(\widehat{\theta }_{j,{\textit{AIPW}}}\) for \(a_{ij}\) and \(\theta _j\), respectively, the plug-in variance estimator for \({\widehat{\theta }}_{j,{\textit{AIPW}}}\) is

$$\begin{aligned} \widehat{V}_{j,poi}&= N^{-2}\sum _{i=1}^N\frac{I_ir_{ij}(1-p_{ij})}{\pi _i^2p_{ij}^2}(y_{ij}-\widehat{a}_{ij})^2\\&\quad + N^{-2}\sum _{i=1}^N\frac{I_ir_{ij}(1-\pi _i)}{p_{ij}\pi _i^2}(y_{ij}-\widehat{\theta }_{j,{\textit{AIPW}}})^2 \end{aligned}$$

under Poisson sampling.

Use a similar argument, we can show that the plug-in variance estimator is

$$\begin{aligned} {\hat{V}}_{ j,srs}&= n^{-2}\sum _{i=1}^n\frac{I_ir_{ij}(1-{\hat{p}}_{ij})}{{\hat{p}}_{ij}^2}(y_{ij}-{\widehat{a}}_{ij})^2+ n^{-1}(1-nN^{-1})\\&\quad \left[ n^{-1}\sum _{i=1}^N\frac{I_ir_{ij}(y_{ij}-\widehat{\theta }_{j,{\textit{AIPW}}})^2}{{\hat{p}}_{ij}} - \left( n^{-1}\sum _{i=1}^N\frac{I_ir_{ij}(y_{ij}-\widehat{\theta }_{j,{\textit{AIPW}}})}{{\hat{p}}_{ij}}\right) ^2 \right] \end{aligned}$$

under simple random sampling, and the one is

$$\begin{aligned} {\hat{V}}_{ j,pps}&= N^{-2}\sum _{i\in S}\frac{r_{ij}(1-{\hat{p}}_{ij})}{(nq_i)^2{\hat{p}}_{ij}^2}(y_{ij}-{\widehat{a}}_{ij})^2 + \{n(n-1)\}^{-1}\\&\quad \left\{ \sum _{i\in S}\frac{r_{ij}(y_{ij}-\widehat{\theta }_{j,{\textit{AIPW}}})^2}{{\hat{p}}_{ij}q_i^2} - n^{-1}\left( \sum _{i\in S}\frac{r_{ij}(y_{ij}-\widehat{\theta }_{j,{\textit{AIPW}}})}{{\hat{p}}_{ij}q_i}\right) ^2\right\} \end{aligned}$$

under probability-proportional-to-size sampling, where S is the index set of the sample, and \(q_i\) is the selection probability of the ith element.

E Balanced repeated replication method

Consider a stratified multi-stage sampling design with two clusters selected per stratum for the first stage. Denote \(w_{hik}\) to be the survey weight associated with \(y_{hik}\), the kth sample element in the ith cluster of the hth stratum. The basic idea of the modified balanced repeated replication is to use the same estimation method based on the “reconstruct” the survey weight, that is,

$$\begin{aligned} w_{hik}^{(r)} = w_{hik} (1+\epsilon \delta _{rh}), \end{aligned}$$

where \(\epsilon \in (0,1)\) is a predefined constant, and \(\delta _{rh}=1\) or \(\delta _{rh} = -1\) for the rth repetition. A set of R repetitions is said to be balanced if \(\sum _{r=1}^R\delta _{rh}\delta _{rh'}=0\) for \(h\ne h'\). The \(R\times H\) matrix \((\delta _{rh})_{R\times H}\) can be obtained from a Hadamard matrix. Please check Rao and Shao (1999) for details about the modified balanced repeated replication method. In the simulation study and real data application, we choose \(\epsilon = 1/2\) and \(R =H\).

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mao, X., Wang, Z. & Yang, S. Matrix completion under complex survey sampling. Ann Inst Stat Math 75, 463–492 (2023). https://doi.org/10.1007/s10463-022-00851-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-022-00851-5

Keywords

Navigation