Matrix completion under complex survey sampling

Mao, Xiaojun; Wang, Zhonglei; Yang, Shu

doi:10.1007/s10463-022-00851-5

Matrix completion under complex survey sampling

Published: 19 September 2022

Volume 75, pages 463–492, (2023)
Cite this article

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Xiaojun Mao¹,
Zhonglei Wang² &
Shu Yang³

480 Accesses
1 Citation
Explore all metrics

Abstract

Multivariate nonresponse is often encountered in complex survey sampling, and simply ignoring it leads to erroneous inference. In this paper, we propose a new matrix completion method for complex survey sampling. Different from existing works either conducting row-wise or column-wise imputation, the data matrix is treated as a whole which allows for exploiting both row and column patterns simultaneously. A column-space-decomposition model is adopted incorporating a low-rank structured matrix for the finite population with easy-to-obtain demographic information as covariates. Besides, we propose a computationally efficient projection strategy to identify the model parameters under complex survey sampling. Then, an augmented inverse probability weighting estimator is used to estimate the parameter of interest, and the corresponding asymptotic upper bound of the estimation error is derived. Simulation studies show that the proposed estimator has a smaller mean squared error than other competitors, and the corresponding variance estimator performs well. The proposed method is applied to assess the health status of the U.S. population.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Investigating the Performance of a Variation of Multiple Correspondence Analysis for Multiple Imputation in Categorical Data Sets

Article 03 October 2017

Applying the Nonrandomized Diagonal Model to Estimate a Sensitive Distribution in Complex Sample Surveys

Article 01 June 2014

Variable selection in high-dimensional sparse multiresponse linear regression models

Article 23 February 2018

References

Alaya, M. Z., Klopp, O. (2019). Collective matrix completion. Journal of Machine Learning Research, 20(148), 1–43.
MathSciNet MATH Google Scholar
Andridge, R. R., Little, R. J. (2010). A review of hot deck imputation for survey non-response. International Statistical Review, 78(1), 40–64.
Article Google Scholar
Athreya, K. B., Lahiri, S. N. (2006). Measure theory and probability theory. New York: Springer.
MATH Google Scholar
Bi, X., Qu, A., Wang, J., Shen, X. (2017). A group-specific recommender system. Journal of the American Statistical Association, 112(519), 1344–1353.
Article MathSciNet Google Scholar
Cai, T. T., Zhou, W.-X. (2016). Matrix completion via max-norm constrained optimization. Electronic Journal of Statistics, 10(1), 1493–1525.
Article MathSciNet MATH Google Scholar
Candès, E. J., Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6), 717–772.
Article MathSciNet MATH Google Scholar
Carpentier, A., Kim, A. K. (2018). An iterative hard thresholding estimator for low rank matrix recovery with explicit limiting distribution. Statistica Sinica, 28, 1371–1393.
MathSciNet MATH Google Scholar
Chang, T., Kott, P. S. (2008). Using calibration weighting to adjust for nonresponse under a plausible model. Biometrika, 95, 555–571.
Article MathSciNet MATH Google Scholar
Chen, J., Shao, J. (2000). Nearest neighbor imputation for survey data. Journal of Official Statistics, 16, 113–131.
Google Scholar
Chen, T. C., Clark, J., Riddles, M. K., Mohadjer, L. K., Fakhouri, T. H. I. (2020). National health and nutrition examination survey, 2015–2018: Sample design and estimation procedures. National Center for Health Statistics. Vital Health Stat, 2(184), 1–26.
Google Scholar
Chen, Y., Fan, J., Ma, C., Yan, Y. (2019). Inference and uncertainty quantification for noisy matrix completion. Proceedings of the National Academy of Sciences, 116(46), 22931–22937.
Article MathSciNet MATH Google Scholar
Chen, Y., Li, P., Wu, C. (2020). Doubly robust inference with nonprobability survey samples. Journal of the American Statistical Association, 115(532), 2011–2021.
Article MathSciNet MATH Google Scholar
Clogg, C. C., Rubin, D. B., Schenker, N., Schultz, B., Weidman, L. (1991). Multiple imputation of industry and occupation codes in census public-use samples using Bayesian logistic regression. Journal of the American Statistical Association, 86, 68–78.
Article Google Scholar
Davenport, M. A., Romberg, J. (2016). An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4), 608–622.
Article Google Scholar
Davenport, M. A., Plan, Y., van den Berg, E., Wootters, M. (2014). 1-bit matrix completion. Information and Inference, 3(3), 189–223.
Article MathSciNet MATH Google Scholar
Elliott, M. R., Valliant, R. (2017). Inference for nonprobability samples. Statistical Science, 32(2), 249–264.
Article MathSciNet MATH Google Scholar
Fan, J., Gong, W., Zhu, Z. (2019). Generalized high-dimensional trace regression via nuclear norm regularization. Journal of Econometrics, 212(1), 177–202.
Article MathSciNet MATH Google Scholar
Fay, R. E. (1992). When are inferences from multiple imputation valid? Proceedings of the survey research methods section of the American Statistical Association, 227–232. American Statistical Association.
Fletcher Mercaldo, S., Blume, J. D. (2018). Missing data and prediction: The pattern submodel. Biostatistics, 21(2), 236–252.
Article MathSciNet Google Scholar
Foucart, S., Needell, D., Plan, Y., Wootters, M. (2017). De-biasing low-rank projection for matrix completion. Wavelets and sparsity XVII, Vol. 10394, p. 1039417. International Society for Optics and Photonics.
Fuller, W. A. (2009). Sampling statistics. Hoboken, NJ: Wiley.
Book MATH Google Scholar
Fuller, W. A., Kim, J. K. (2005). Hot deck imputation for the response model. Survey Methodology, 31, 139.
Google Scholar
Harchaoui, Z., Douze, M., Paulin, M., Dudik, M., Malick, J. (2012). Large-scale image classification with trace-norm regularization. 2012 IEEE conference on computer vision and pattern recognition, 3386–3393. IEEE.
Horvitz, D. G., Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663–685.
Article MathSciNet MATH Google Scholar
Isaki, C. T., Fuller, W. A. (1982). Survey design under the regression superpopulation model. Journal of the American Statistical Association, 77, 89–96.
Article MathSciNet MATH Google Scholar
Keiding, N., Louis, T. A. (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys. Journal of the Royal Statistical Society: Series A (Statistics in Society), 179, 319–376.
Article MathSciNet Google Scholar
Kim, E., Lee, M., Oh, S. (2015). Elastic-net regularization of singular values for robust subspace learning. Proceedings of the IEEE conference on computer vision and pattern recognition, 915–923.
Kim, J. K., Fuller, W. (2004). Fractional hot deck imputation. Biometrika, 91, 559–578.
Article MathSciNet MATH Google Scholar
Kim, J. K., Yu, C. L. (2011). A semiparametric estimation of mean functionals with nonignorable missing data. Journal of the American Statistical Association, 106, 157–165.
Article MathSciNet MATH Google Scholar
Kim, J. K., Brick, J., Fuller, W. A., Kalton, G. (2006). On the bias of the multiple-imputation variance estimator in survey sampling. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 509–521.
Article MathSciNet MATH Google Scholar
Koltchinskii, V., Lounici, K., Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Annals of Statistics, 39(5), 2302–2329.
Article MathSciNet MATH Google Scholar
Li, H., Chen, N., Li, L. (2012). Error analysis for matrix elastic-net regularization algorithms. IEEE Transactions on Neural Networks and Learning Systems, 23(5), 737–748.
Article Google Scholar
Liu, W., Mao, X., Wong, R. K. W. (2020). Median matrix completion: From embarrassment to optimality. Proceedings of the 37th International Conference on Machine Learning, Vol. 119, 294–6304.
Mao, X., Chen, S. X., Wong, R. K. (2019). Matrix completion with covariate information. Journal of the American Statistical Association, 114(525), 198–210.
Article MathSciNet MATH Google Scholar
Mao, X., Wong, R. K., Chen, S. X. (2021). Matrix completion under low-rank missing mechanism. Statistica Sinica, 31(4), 2005–2030.
MathSciNet MATH Google Scholar
Mazumder, R., Hastie, T., Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11, 2287–2322.
MathSciNet MATH Google Scholar
Meng, X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 9, 538–558.
Google Scholar
Molenberghs, G., Michiels, B., Kenward, M. G., Diggle, P. J. (1998). Monotone missing data and pattern-mixture models. Statistica Neerlandica, 52(2), 153–161.
Article MathSciNet MATH Google Scholar
Negahban, S., Wainwright, M. J. (2012). Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. Journal of Machine Learning Research, 13(1), 1665–1697.
MathSciNet MATH Google Scholar
Nielsen, S. F. (2003). Proper and improper multiple imputation. International Statistical Review, 71, 593–607.
Article MATH Google Scholar
Pfeffermann, D. (1993). The role of sampling weights when modeling survey data. International Statistical Review, 61, 317–337.
Article MATH Google Scholar
Qin, J., Zhang, B., Leung, D. H. (2017). Efficient augmented inverse probability weighted estimation in missing data problems. Journal of Business & Economic Statistics, 35(1), 86–97.
Article MathSciNet Google Scholar
Rao, J. N. K., Shao, J. (1999). Modified balanced repeated replication for complex survey data. Biometrika, 86(2), 403–415.
Article MathSciNet MATH Google Scholar
Robin, G., Klopp, O., Josse, J., Moulines, É., Tibshirani, R. (2020). Main effects and interactions in mixed and incomplete data frames. Journal of the American Statistical Association, 115(531), 1292–1303.
Article MathSciNet MATH Google Scholar
Robins, J. M., Rotnitzky, A., Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89, 846–866.
Article MathSciNet MATH Google Scholar
Robins, J. M., Rotnitzky, A., Zhao, L. P. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association, 90, 106–121.
Article MathSciNet MATH Google Scholar
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.
Article MathSciNet MATH Google Scholar
Rubin, D. B. (1978). Multiple imputations in sample surveys—a phenomenological Bayesian approach to nonresponse. Proceedings of the survey research methods section of the American Statistical Association, Vol. 1, 20–34. American Statistical Association.
Sengupta, N., Srebro, N., Evans, J. (2021). Simple surveys: Response retrieval inspired by recommendation systems. Social Science Computer Review, 39(1), 105–129.
Article Google Scholar
Sun, T., Zhang, C.-H. (2012). Calibrated elastic regularization in matrix completion. Advances in Neural Information Processing Systems, 25, 863–871.
Google Scholar
Sweeting, T. (1980). Uniform asymptotic normality of the maximum likelihood estimator. Annals of Statistics, 1375–1381.
Tan, Z. (2013). Simple design-efficient calibration estimators for rejective and high-entropy sampling. Biometrika, 100(2), 399–415.
Article MathSciNet MATH Google Scholar
Tang, G., Little, R. J., Raghunathan, T. E. (2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika, 90, 747–764.
Article MathSciNet MATH Google Scholar
van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67.
van der Linden, W. J., Hambleton, R. K. (2013). Handbook of modern item response theory. New York, NY: Springer.
MATH Google Scholar
Wang, N., Robins, J. M. (1998). Large-sample theory for parametric multiple imputation procedures. Biometrika, 85, 935–948.
Article MathSciNet MATH Google Scholar
Wang, S., Shao, J., Kim, J. K. (2014). An instrument variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, 24(3), 1097–1116.
MathSciNet MATH Google Scholar
Wang, Z., Peng, L., Kim, J. K. (2022). Bootstrap inference for the finite population mean under complex sampling designs. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Accepted.
Wu, C. (2003). Optimal calibration estimators in survey sampling. Biometrika, 90(4), 937–951.
Article MathSciNet MATH Google Scholar
Yang, S., Kim, J. K. (2016). A note on multiple imputation for method of moments estimation. Biometrika, 103(1), 244–251.
Article MathSciNet MATH Google Scholar
Yang, S., Wang, L., Ding, P. (2019). Causal inference with confounders missing not at random. Biometrika, 106(4), 875–888.
Article MathSciNet MATH Google Scholar
Zhang, C., Taylor, S. J., Cobb, C., Sekhon, J. (2020). Active matrix factorization for surveys. Annals of Applied Statistics, 14(3), 1182–1206.
Article MathSciNet MATH Google Scholar
Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We are grateful to two referees and the Associate Editor for their constructive comments which have greatly improved the paper. Mao is partially supported by NSFC (No.: 12001109, 92046021) and the Science and Technology Commission of Shanghai Municipality grant 20dz1200600. Wang is partially supported by NSFC (No.: 11901487, 72033002) and the Fundamental Scientific Center of National Natural Science Foundation of China Grant No. 71988101. Yang is partly supported by the NSF DMS 1811245, NIH 1R01AG066883 and 1R01ES031651.

Author information

Authors and Affiliations

School of Mathematical Sciences, Ministry of Education Key Laboratory of Scientific and Engineering Computing, Shanghai Jiao Tong University, Shanghai, 200240, People’s Republic of China
Xiaojun Mao
Wang Yanan Institute for Studies in Economics and School of Economics, Xiamen University, Xiamen, 361005, Fujian, People’s Republic of China
Zhonglei Wang
Department of Statistics, North Carolina State University, Raleigh, NC, 27695, USA
Shu Yang

Authors

Xiaojun Mao
View author publications
You can also search for this author in PubMed Google Scholar
Zhonglei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shu Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhonglei Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 218 KB)

Appendices

Appendix

A Technical conditions

The technical conditions needed for our analysis are given as follows.

C1
(a) The random errors $\{\epsilon _{ij}:i=1,\ldots ,N;j=1,\ldots ,L\}$ in (2) are independently distributed random variables such that $E(\epsilon _{ij})=0$ and $E(\epsilon ^2_{ij})=\sigma _{ij}^2<\infty$ for all i, j. (b) For some finite positive constants $c_{\sigma }$ and $\eta$, ${\max }_{i,j}E|\epsilon _{ij}|^{l}\le \frac{1}{2}l!c_{\sigma }^{2}\eta ^{l-2}$ for any positive integer $l\ge 2$.
C2
The inclusion probability satisfies $\pi _i\asymp nN^{-1}$ for $i=1,\ldots ,N$.
C3
The population design matrix ${\varvec{X}}_N$ is of size $N\times d$ such that $N>d$. Moreover, there exists a positive constant $a_{x}$ such that $\Vert {\varvec{X}}_N\Vert _{\infty }\le a_{x}$ and ${\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N$ is invertible, where ${\varvec{D}}_N$ is a diagonal matrix with $\pi _i$ as its (i, i)th entry. Furthermore, there exists a symmetric matrix ${\varvec{S}}_{{\varvec{X}}}$ with $\sigma _{\min }({\varvec{S}}_{{\varvec{X}}})\asymp 1\asymp \Vert {\varvec{S}}_{{\varvec{X}}}\Vert$ such that $n_0^{-1}{\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N\rightarrow {\varvec{S}}_{{\varvec{X}}}$ as $N\rightarrow \infty$, where $n_0=\sum _{i=1}^N\pi _i$ is the expected sample.
C4
There exists a positive constant a such that $\max \{\Vert {\varvec{X}}_N\varvec{\beta }^{*}\Vert _{\infty },\Vert {\varvec{A}}_N\Vert _{\infty }\}\le a$.
C5
The indicators of observed entries $\{r_{ij}:i=1,\ldots ,N;j=1,\ldots ,L\}$ are mutually independent, $r_{ij}\sim \text {Bern}(p_{ij})$ for $p_{ij}\in (0,1)$ and are independent of $\{\epsilon _{ij}\}_{i,j=1}^{N,L}$ given ${\varvec{X}}_N$. Furthermore, for $i=1,\dots ,N$ and $j=1,\dots ,L$, $\Pr (r_{ij}=1 | {\varvec{x}}_{i}, y_{ij}) = \Pr (r_{ij}=1 | {\varvec{x}}_{i})$ follows the logistic regression model (7).
C6
There exists a lower bound $p_{\min }\in (0,1)$ such that ${\min }_{i,j}\{p_{ij}\}\ge p_{\min }>0$, where $p_{\min }$ is allowed to depend on n and L. The number of questions $L\le n$.
C7
The sampling design satisfies that $N^{-1}\sum _{i=1}^Ny_i\pi _i^{-1} = O_p(n^{-1/2})$ if $N^{-1}\sum _{i=1}^Ny_i^{2}$ is asymptotically bounded.

Condition C1(a) is a common regularity condition for the measurement errors in $\varvec{\epsilon }_{N}$, and C1(b) is the Bernstein condition (Koltchinskii et al., 2011). Condition C2 is widely used in survey sampling and regulates the inclusion probabilities of a sampling design (Fuller, 2009). In Condition C3, the requirement $N>d$ is easily met as the number of questions in a survey is usually fixed, and the population size is often larger than the number of questions. As the dimension of $n_0^{-1}{\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N$ is fixed at $d\times d$, it is mild to assume ${\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N$ to be invertible, and there exists a symmetric matrix ${\varvec{S}}_{{\varvec{X}}}$ as the limit of $n_0^{-1}{\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N$. Please notice that we do not assume randomness for generating ${\varvec{X}}_N$, and it is a common assumption for design-based framework. Furthermore, the sample size is often larger than the number of questions, that is, $n>d$, and it is not hard to show that together with Condition C2, the probability limit of $n^{-1}{\varvec{X}}_n^{\mathrm{T}}{\varvec{X}}_n$ is also ${\varvec{S}}_{{\varvec{X}}}$ under regularity conditions. The order of $\sigma _{\min }({\varvec{S}}_{{\varvec{X}}})$ and $\Vert {\varvec{S}}_{{\varvec{X}}}\Vert$ equals to 1 is due to $\Vert {\varvec{X}}_N\Vert _{\infty }<\infty$. Condition C4 is also standard in the matrix completion literature (Cai and Zhou, 2016; Koltchinskii et al., 2011; Negahban and Wainwright, 2012). Especially, it is reasonable to assume all the responses are bounded in survey sampling. Condition C5 describes the independent Bernoulli model for the response indicator of observing $y_{ij}$, where the probability of observation $p_{ij}$ follows the logistic model (7). In Condition C6, the lower bound $p_{\min }$ is allowed to go to 0 with n and L growing. This condition is more general than we need for a typical survey, and $p_{\min }\asymp 1$ suffices. Typically, the number of questions L grows slower than the number of participants n in survey sampling. Thus, the assumption that $L\le n$ is quite mild. Condition C7 is a mild restriction on the estimator for the population mean, and it can be satisfied under general sampling designs. To get general results, we do not make any assumptions for the asymptotic relationship between the population size N and the sample size n; see Theorem 1 for details. We can make further assumptions for the sample sizes to guarantee certain convergence properties; see the discussion of Theorem 3.

B Lemmas

Under the logistic model (7), together with the results in Mao et al. (2019) and Sweeting (1980), it can been shown that for all $t>d+3$, there exist some positive constants $C_g$, C and $C_{d}$ such that $\Pr \{\sum _{ij}(1/\widehat{p}_{ij}-1/p_{ij})^{2}\ge C_g p_{\min }^{-3}t\}\le C_{d}t\exp \{-t/2\}+L\max _{j}\sup _{t}|\Pr \{\sum _{i}(1/\widehat{p}_{ij}-1/p_{ij})^{2}\ge t\}-\Pr (\chi ^{2}_{d+1}\ge C_g p_{\min }^{-3}t)|$. Then, $\max _{j}\sup _{t}|\Pr \{\sum _{i}(1/\widehat{p}_{ij}-1/p_{ij})^{2}\ge t\}-\Pr (\chi ^{2}_{d+1}\ge C_g p_{\min }^{-3}t)|\le L^{-2}$ and $C_{d}t\exp \{-t/2\}$ is a function independent of n and L such that $\lim _{t\rightarrow \infty }t\exp \{-t/2\}= 0$.

Write ${\varvec{J}}_{ij}={\varvec{e}}_{i}(n_1){\varvec{e}}^\intercal _{j}(n_2)$, where ${\varvec{e}}_{i}(n)\in {\mathbb {R}}^n$ is the standard basis vector with the i-th element being 1 and the rest being 0. Now we present several lemmas.

Lemma 1

Under Conditions C2 and C3 and Poisson sampling, we have

$$\begin{aligned} n^{-1}{\varvec{X}}_n^{\mathrm{T}}{\varvec{X}}_n={\varvec{S}}_{{\varvec{X}}}+ o_p(1). \end{aligned}$$

(15)

Proof of Lemma 1

Denote ${\varvec{e}}_i$ to be a column vector of length d with jth element being 1 and others being 0. Recall that $n_0 = \sum _{k=1}^N\pi _k$ is the expected sample size. For $i=1,\ldots ,d$ and $j=1,\ldots ,d$, consider

$$\begin{aligned} E( n_0^{-1}{\varvec{e}}_i^{\mathrm{T}}{\varvec{X}}_n^{\mathrm{T}}{\varvec{X}}_n{\varvec{e}}_j)&= n_0^{-1}E\left( \sum _{k=1}^NI_kx_{ki}x_{kj}\right) = n_0^{-1}\sum _{i=1}^Nx_{ki}\pi _kx_{kj} \nonumber \\&= n_0^{-1}{\varvec{e}}_i^{\mathrm{T}}{\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N{\varvec{e}}_j, \end{aligned}$$

(16)

where the expectation is take with respect to the sampling design, and ${\varvec{x}}_i$ is the ith row of ${\varvec{X}}_N$.

Under Poisson sampling, we have

$$\begin{aligned} \mathrm {var}( n_0^{-1}{\varvec{e}}_i^{\mathrm{T}}{\varvec{X}}_n^{\mathrm{T}}{\varvec{X}}_n{\varvec{e}}_j)&= n_0^{-2}\sum _{k=1}^N\pi _k(1-\pi _k)x_{k,i}x_{k,j}\nonumber \\ &< n_0^{-2}\sum _{k=1}^N\pi _kx_{k,i}x_{k,j}\nonumber \\ &= n_0^{-1}\left( n_0^{-1}{\varvec{e}}_i^{\mathrm{T}}{\varvec{X}}_N^{\mathrm{T}}{\varvec{D}}_N{\varvec{X}}_N{\varvec{e}}_j\right) . \end{aligned}$$

(17)

By a similar argument in Wang et al. (2019), we can show that $n/n_0\rightarrow 1$ in probability as $N\rightarrow \infty$ under Condition C2. By Condition C3, (16) and (17), we have proved Lemma 1. $\square$

Lemma 2

Let $\Psi ^{(1)}=\sum _{ij}r_{ij}\epsilon _{ij}{\varvec{J}}_{ij}/(nL\widehat{p}_{ij}\pi _{ij}^{1/2})$. Under Conditions C1, C5 and C6 and Poisson sampling, for some positive constants $C_1$, $c_{\sigma }$, $\eta$, $\delta _{\sigma }$ and all $t>d+3$, we have

$$\begin{aligned}&\left\Vert \Psi ^{(1)}\right\Vert \\&\quad \le C_1\max \left\{ N^{1/2}n^{-1}L^{-1}\log ^{1/2}\left( n\right) p_{\min }^{-1/2},N^{1/2}n^{-5/4}L^{-1/4}\log ^{1/2}\left( L\right) \log ^{\delta _{\sigma }/4}\left( n\right) t^{1/2}p_{\min }^{-3/2}\right\} \end{aligned}$$

holds with probability at least $1-1/(n+L)-C_{d}t\exp \{-t/2\}-1/L-12c_{\sigma }^{2}\eta ^{2}\log ^{-\delta _{\sigma }}(n)$.

Lemma 3

Let $\Psi ^{(2)}=\sum _{ij}a_{ij}(r_{ij}/p_{ij}-1){\varvec{J}}_{ij}/(nL\pi _{ij}^{1/2})$. Under Conditions C4–C6 and Poisson sampling, for some positive constants $C_2$, we have

$$\begin{aligned} \left\Vert \Psi ^{(2)}\right\Vert \le C_2N^{1/2}n^{-1}L^{-1}\log ^{1/2}\left( n\right) p_{\min }^{-1/2} \end{aligned}$$

holds with probability at least $1-1/(n+L)$.

Lemma 4

Let $\Psi ^{(3)}=\sum _{ij}a_{ij}(r_{ij}/{\widehat{p}}_{ij}-r_{ij}/p_{ij}){\varvec{J}}_{ij}/(nL\pi _{ij}^{1/2})$. Under Conditions C4 and C6 and Poisson sampling, for some positive constants $C_3$, $\delta _{\sigma }$ and all $t>d+3$, we have

$$\begin{aligned} \left\Vert \Psi ^{(3)}\right\Vert \le C_3N^{1/2}n^{-5/4}L^{-1/4}\log ^{1/2}\left( L\right) \log ^{\delta _{\sigma }/4}\left( n\right) t^{1/2}p_{\min }^{-3/2} \end{aligned}$$

holds with probability at least $1-C_{d}t\exp \{-t/2\}-1/L$.

It is easy to show Lemma 2–4 by the proof of Lemma S4.1–S4.3 in the supplementary material of Mao et al. (2019).

C Proofs

1.1 C.1 Proof of Theorem 1

With the definition of $\Delta (\delta _{\sigma },t)$ in (12), under Conditions C1–C6 and Poisson sampling, together with Lemmas 2–4, We have for a positive constant $C_0$,

$$\begin{aligned} \left\Vert \Psi ^{(1)}\right\Vert +\Vert \Psi ^{(2)}\Vert +\left\Vert \Psi ^{(3)}\right\Vert \le C_{0} \Delta (\delta _{\sigma },t), \end{aligned}$$

with probability at least $1-2/n-2C_{d}t\exp \{-t/2\}-2/L-12c_{\sigma }^{2}\eta ^{2}\log ^{-\delta _{\sigma }}(n)$.

Let ${\varvec{X}}_n^{\prime }={\varvec{D}}_n^{-1/2}{\varvec{X}}_n$ and $\eta _{n,L}(\delta _{\sigma },t) = 4/(n+L)+ 4C_{d}t\exp \{-t/2\}+4/L+C\log ^{-\delta _{\sigma }}(n)$ for a positive constant C. By choosing t as (13), $\tau _1\asymp N^{-1}nL^{-1}\log ^{-1/2}(n)\Delta (\delta _{\sigma })$ and $\tau _2\asymp \eta _{g}^{-1/2}N^{-1}n^{1/4}L^{-1/4}\log ^{1/2}(L)\log ^{\delta _{\sigma }/3}(n)$, where $\Delta (\delta _{\sigma })=N^{1/2}n^{-1}L^{-1}\log ^{1/2}(n)p_{\min }^{-1/2}$ and $1-\alpha \asymp (nL)^{-1}$ in (8) for any $\delta _{\sigma }>0$, together with Condition C2 and Poisson sampling, it follows the same proof with the proof of Corollary 1 in Mao et al. (2019) that, for some constants $C_1$ and $C_2$, with probability at least $1-\eta _{n,L}(\delta _{\sigma },t)$,

$$\begin{aligned} \text {both}\quad \frac{1}{nL}\left\Vert {\varvec{X}}_n{\widehat{\varvec{\beta }}}^{\prime }-{\varvec{X}}_n\varvec{\beta }^{*\prime }\right\Vert _F^{2} \quad \text {and}\quad \frac{1}{nL}\left\Vert {\widehat{{\varvec{B}}}}_n^{\prime }-{\varvec{B}}_n^{*\prime }\right\Vert _F^{2}\le C_1r_{{\varvec{B}}_N} Nn^{-1}L^{-1}\log \left( n\right) p_{\min }^{-1}. \end{aligned}$$

Thus it is easy to obtain that

$$\begin{aligned} \frac{1}{mL}\left\Vert {\widehat{\varvec{\beta }}}^{\prime }-\varvec{\beta }^{*\prime }\right\Vert _F^{2}\le C_2r_{{\varvec{B}}_N} L^{-1}\log \left( n\right) p_{\min }^{-1}, \end{aligned}$$

under Condition C3. $\square$

1.2 C.2 Proof of Theorem 2

Due to the observations that

$$\begin{aligned} \left\Vert {\widehat{{\varvec{A}}}}_n-{\varvec{A}}_n\right\Vert _{F}^2\le \left\Vert {\varvec{X}}_n{\widehat{\varvec{\beta }}}^{\prime }-{\varvec{X}}_n\varvec{\beta }^{*\prime }\right\Vert _F^{2}+\left\Vert {\varvec{D}}_n^{-1/2}{\widehat{{\varvec{B}}}}_n^{\prime }-{\varvec{D}}_n^{-1/2}{\varvec{B}}_n^{*\prime }\right\Vert _F^{2}, \end{aligned}$$

together with Theorem 1, it is easy to obtain the result under Condition C2 and Poisson sampling.$\square$

1.3 C.3 Proof of Theorem 3

Denote

$$\begin{aligned}&{\tilde{\theta }}_{j,{\textit{AIPW}}} = N^{-1}\sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij} - {a}_{ij})}{{p}_{ij}} + {a}_{ij}\right\} , \end{aligned}$$

(18)

$$\begin{aligned}&{\theta }^\dag _{j,{\textit{AIPW}}} = N^{-1}\sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij} - \widehat{a}_{ij})}{{p}_{ij}} + \widehat{a}_{ij}\right\} \end{aligned}$$

(19)

for $j=1,\ldots ,L$. The difference between $\widehat{\theta }_{j,{\textit{AIPW}}}$ in (10) and ${\tilde{\theta }}_{j,{\textit{AIPW}}}$ in (18) is that we use estimators $\widehat{N}$, $\widehat{p}_{ij}$ and $\widehat{a}_{ij}$ for $\widehat{\theta }_{j,{\textit{AIPW}}}$ but use true values N, $p_{ij}$ and $a_{ij}$ for ${\tilde{\theta }}_{j,{\textit{AIPW}}}$. The difference between $\widehat{\theta }_{j,{\textit{AIPW}}}$ and ${\theta }^\dag _{j,{\textit{AIPW}}}$ in (19) is that we use $\widehat{N}$ for $\widehat{\theta }_{j,{\textit{AIPW}}}$, but use N for ${\theta }^\dag _{j,{\textit{AIPW}}}$.

First, we prove

$$\begin{aligned} {\tilde{\theta }}_{j,{\textit{AIPW}}} - \theta _j = O_p(n^{-1/2}). \end{aligned}$$

(20)

Consider

$$\begin{aligned} E( {\tilde{\theta }}_{j,{\textit{AIPW}}}) = E\{E({\tilde{\theta }}_{j,{\textit{AIPW}}}\mid \{I_i\})\} = N^{-1}\sum _{i=1}^N\frac{E(I_i)}{\pi _i}(y_{ij}) = \theta _j, \end{aligned}$$

(21)

where the first equality holds due to $E(r_{ij}) =p_{ij}$. Next, we derive the variance of ${\tilde{\theta }}_{j,{\textit{AIPW}}}$. Specifically, we have

$$\begin{aligned} \mathrm {var}({\tilde{\theta }}_{j,{\textit{AIPW}}})&= \frac{1}{N^2}E\left( \mathrm {var}\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}\right\} \mid S\right] \right) \nonumber \\&\quad + \frac{1}{N^2}\mathrm {var}\left( E\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}\right\} \mid S\right] \right) = V_{1,j} + V_{2,j}, \end{aligned}$$

(22)

where $S=\{I_i:i=1,\ldots ,N\}$.

Because $E(r_{ij}) = p_{ij}$, we have

$$\begin{aligned} \mathrm {var}\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}\right\} \mid S\right] = \sum _{i=1}^N\frac{I_i(1-p_{ij})}{\pi _i^2p_{ij}}(y_{ij}-a_{ij})^2. \end{aligned}$$

Thus, we have

$$\begin{aligned} V_{1,j} = \frac{1}{N^2}\sum _{i=1}^N\frac{1-p_{ij}}{\pi _{i}}(y_{ij}-a_{ij})^2=O_p(n^{-1}), \end{aligned}$$

(23)

where the last equality holds by Conditions C1, C2 and the strong law of large numbers (Athreya and Lahiri, 2006). Notice that

$$\begin{aligned} E\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}\right\} \mid S\right] =\sum _{i=1}^N\frac{I_iy_{ij}}{\pi _i}. \end{aligned}$$

By the models (1)–(2) and Condition C4, we can show that $N^{-1}\sum _{i=1}^Ny_i^2$ is asymptotically bounded. Thus, by Condition C7, we have

$$\begin{aligned} V_{2,j}=O_p(n^{-1}) \end{aligned}$$

(24)

By (21)–(24), we have shown (20).

Next, we show that

$$\begin{aligned} \frac{1}{L}\sum _{j=1}^L({\theta }^\dag _{j,{\textit{AIPW}}} - {\tilde{\theta }}_{ j,{\textit{AIPW}}})^2=O_p\{r_{{\varvec{B}}_N}L^{-1}\log \left( n\right) \}. \end{aligned}$$

(25)

Consider

$$\begin{aligned} {\theta }^\dag _{j,{\textit{AIPW}}} - {\tilde{\theta }}_{ j,{\textit{AIPW}}} = \frac{1}{N}\sum _{i=1}^N\frac{I_i}{\pi _{i}}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})(p_{ij}-\widehat{p}_{ij})}{p_{ij}\widehat{p}_{ij}} + \frac{(r_{ij}-\widehat{p}_{ij})(a_{ij}-\widehat{a}_{ij})}{\widehat{p}_{ij}}\right\} . \end{aligned}$$

(26)

Consider

$$\begin{aligned} E\left\{ \frac{1}{N}\sum _{i\in S}\pi _{i}^{-1}\frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} \right\}&= \frac{1}{N}\sum _{i=1}^N(y_{ij}-a_{ij})=O_p(N^{-1/2}), \end{aligned}$$

(27)

$$\begin{aligned} \mathrm {var}\left\{ \frac{1}{N}\sum _{i\in S}\pi _{i}^{-1}\frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} \right\}&= O_p(n^{-1}), \end{aligned}$$

(28)

where the asymptotic order in (28) holds due to Condition C7 and $N^{-1}\sum _{i=1}^N(y_{ij}-a_{ij})^2$ is asymptotically bounded in probability since $\{\epsilon _{ij}:j=1,\ldots ,N\}$ are independent and their variances are uniformly bounded. By (27) and (28), we have

$$\begin{aligned} \frac{1}{N}\sum _{i=1}^N\frac{I_i}{\pi _{i}}\frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} = O_p(n^{-1/2} ). \end{aligned}$$

(29)

Because the response model (7) for $p_{ij}$ is assumed to be correctly specified, and $p_{ij}$ is bounded away from 0 by Condition C6, we have $\widehat{p}_{ij}^{-1}(p_{ij}-\widehat{p}_{ij}) = O_p(1)$ uniformly for $i=1,\ldots ,N$. Thus, by (29), we have

$$\begin{aligned} N^{-1}\sum _{i=1}^N\frac{I_i}{\pi _{i}}\frac{r_{ij}(y_{ij}-a_{ij})(p_{ij}-\widehat{p}_{ij})}{p_{ij}\widehat{p}_{ij}} =O_p(n^{-1/2}). \end{aligned}$$

(30)

By (26) and (30), we have

$$\begin{aligned} {\theta }^\dag _{j,{\textit{AIPW}}} - {\tilde{\theta }}_{ j,{\textit{AIPW}}} = O_p(n^{-1/2}) + \frac{1}{N}\sum _{i=1}^N\frac{I_i}{\pi _i}\frac{(r_{ij}-\widehat{p}_{ij})(a_{ij}-\widehat{a}_{ij})}{\widehat{p}_{ij}}. \end{aligned}$$

(31)

Since $\widehat{p}_{ij}=p_{ij}+o_p(1)$, we have

$$\begin{aligned} \frac{r_{ij}-\widehat{p}_{ij}}{\widehat{p}_{ij}}=\{1+o_p(1)\}\frac{r_{ij}-p_{ij}}{p_{ij}}. \end{aligned}$$

(32)

Since $p_{ij}\ge p_{\mathrm {min}}>0$ by Condition C6, we have

$$\begin{aligned} \frac{r_{ij}-\widehat{p}_{ij}}{\widehat{p}_{ij}}= O_p(1) \end{aligned}$$

(33)

uniformly for $i=1,\ldots ,N$.

Thus, by Condition C2, (31) and (33), we have

$$\begin{aligned} \frac{1}{L}\sum _{j=1}^L({\theta }^\dag _{j,{\textit{AIPW}}} - {\tilde{\theta }}_{ j,{\textit{AIPW}}})^2&\le O_p(n^{-1})+ \frac{C_6 O_p(1)}{Ln^2}\sum _{j=1}^L\left\{ \sum _{i=1}^n(a_{ij} - \widehat{a}_{ij}) \right\} ^2,\nonumber \\ &\le O_p(n^{-1})+ \frac{ O_p(1)}{Ln}\sum _{j=1}^L\sum _{i=1}^n(a_{ij} - \widehat{a}_{ij})^2\nonumber \\ &=O_p(n^{-1})+ \frac{O_p(1)}{nL}\left\Vert \widehat{A}_n-A_n\right\Vert _F^2, \end{aligned}$$

(34)

where $\pi _i\ge C_6^{-1}nN^{-1}$ for $C_6>0$ by Condition C2, and we have assumed that the first n subjects are sampled. By (20), (34) and Theorem 2 and the fact that $L\le n$, we have proved (25).

By Condition C4 and the fact that $E(\epsilon ^2_{ij})<\sigma _0^2$ uniformly, $\theta _j$ is uniformly bounded for $j=1,\ldots ,L$ in probability. Thus, by (20) and (25), we conclude that

$$\begin{aligned} \frac{1}{L}\sum _{j=1}^L({\theta }^\dag _{j,{\textit{AIPW}}})^2 = O_p\{r_{{\varvec{B}}_N}L^{-1}\log \left( n\right) \} \end{aligned}$$

(35)

By Condition C7, we conclude that $\widehat{N}N^{-1}=1+O_p(n^{-1/2})$. Consider

$$\begin{aligned} \frac{1}{L}\sum _{j=1}^L(\widehat{\theta }_{j,{\textit{AIPW}}} - {\theta }^\dag _{ j,{\textit{AIPW}}})^2 = \frac{O_p(n^{-1})}{L}\sum _{j=1}^L({\theta }^\dag _{ j,{\textit{AIPW}}})^2 = o_p\{r_{{\varvec{B}}_N}L^{-1}\log \left( n\right) \}, \end{aligned}$$

(36)

where the first equality holds since $\widehat{N}N^{-1}=1+O_p(n^{-1/2})$ uniformly for $j=1,\ldots ,L$, and the second equality holds by (35). Thus, by (34) and (36), we have proved Theorem 3. $\square$

D Plug-in variance estimators

When deriving the plug-in variance estimator, we ignore the variability for estimating $\widehat{a}_{ij}$. First, we consider the plug-in variance estimator under Poisson sampling. For $j=1,\ldots ,L$, let

$$\begin{aligned} g_j(\theta ) = \frac{1}{N}\sum _{i=1}^{N}\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-{a}_{ij})}{{p}_{ij}}+{a}_{ij}-\theta \right\} \end{aligned}$$

be the estimating function for the AIPW estimator with $\widehat{a}_{ij}$ replaced by ${a}_{ij}$. Let ${\tilde{\theta }}_{j,{\textit{AIPW}}}$ solves $g_j(\theta )=0$, and we use a variance estimator of ${\tilde{\theta }}_{j,{\textit{AIPW}}}$ to approximate that of $\widehat{\theta }_{j,{\textit{AIPW}}}$.

It can be shown that ${\tilde{\theta }}_{j,{\textit{AIPW}}}-\theta _j=O_p(n^{-1/2})$, so we have

$$\begin{aligned} 0 = g_j({\tilde{\theta }}_{j,{\textit{AIPW}}}) = g_j(\theta _j) + g_j'(\theta _j) ({\tilde{\theta }}_{j,{\textit{AIPW}}}-\theta _j) + o_p(n^{-1/2}), \end{aligned}$$

(37)

where $g_j'(\theta ) = -N^{-1}\sum _{i=1}^NI_i\pi _i^{-1}$ is the derivative of $g_j(\theta )$. By a similar argument for (27)–(28), we can show that $g_j'(\theta _j)\rightarrow -1$ in probability. Besides, by (37), we have

$$\begin{aligned} ({\tilde{\theta }}_{j,{\textit{AIPW}}}-\theta _j) = -\{g_j'(\theta _j)\}^{-1}g_j(\theta _j) + o_p(n^{-1/2}) = g_j(\theta _j)+ o_p(n^{-1/2}). \end{aligned}$$

Thus, the variance of ${\tilde{\theta }}_{j,{\textit{AIPW}}}$ can be estimated by the one of $g_j(\theta _j)$.

Consider

$$\begin{aligned} \mathrm {var}\{g_j(\theta _j)\}&= N^{-2}E\left( \mathrm {var}\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}-\theta _j\right\} \mid S\right] \right) \nonumber \\&\quad + N^{-2}\mathrm {var}\left( E\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}-\theta _j\right\} \mid S\right] \right) \nonumber \\&= V_{1,j} + V_{2,j}. \end{aligned}$$

(38)

Since $E(r_{ij}) = p_{ij}$, we have

$$\begin{aligned} \mathrm {var}\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{m_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}-\theta _j\right\} \mid S\right] = \sum _{i=1}^N\frac{I_i(1-p_{ij})}{\pi _i^2p_{ij}}(y_{ij}-a_{ij})^2. \end{aligned}$$

Thus, we have

$$\begin{aligned} V_{1,j} = N^{-2}\sum _{i=1}^N\frac{1-p_{ij}}{\pi _ip_{ij}}(y_{ij}-a_{ij})^2, \end{aligned}$$

(39)

and it can be estimated by

$$\begin{aligned} \widehat{V}_{1,j} = N^{-2}\sum _{i=1}^N\frac{r_{ij}(1-p_{ij})}{\pi _i^2p_{ij}^2}(y_{ij}-a_{ij})^2. \end{aligned}$$

(40)

Notice that

$$\begin{aligned} E\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}-\theta _j\right\} \mid S\right] =\sum _{i=1}^N\frac{I_i}{\pi _i}(y_{ij}-\theta _j). \end{aligned}$$

Under Poisson sampling,

$$\begin{aligned} \mathrm {var}\left( E\left[ \sum _{i=1}^N\frac{I_i}{\pi _i}\left\{ \frac{r_{ij}(y_{ij}-a_{ij})}{p_{ij}} + a_{ij}-\theta _j\right\} \mid S\right] \right) = \sum _{i=1}^N\frac{1-\pi _i}{\pi _i}(y_{ij}-\theta _j)^2. \end{aligned}$$

Thus,

$$\begin{aligned} V_{2,j} = N^{-2} \sum _{i=1}^N\frac{1-\pi _i}{\pi _i}(y_{ij}-\theta _j)^2, \end{aligned}$$

(41)

and it can be estimated by

$$\begin{aligned} \widehat{V}_{2,j} = N^{-2}\sum _{i=1}^N\frac{I_ir_{ij}(1-\pi _i)}{p_{ij}\pi _i^2}y_{ij}^2. \end{aligned}$$

(42)

By (38)–(42) and plugging in $\widehat{a}_{ij}$ and $\widehat{\theta }_{j,{\textit{AIPW}}}$ for $a_{ij}$ and $\theta _j$, respectively, the plug-in variance estimator for ${\widehat{\theta }}_{j,{\textit{AIPW}}}$ is

$$\begin{aligned} \widehat{V}_{j,poi}&= N^{-2}\sum _{i=1}^N\frac{I_ir_{ij}(1-p_{ij})}{\pi _i^2p_{ij}^2}(y_{ij}-\widehat{a}_{ij})^2\\&\quad + N^{-2}\sum _{i=1}^N\frac{I_ir_{ij}(1-\pi _i)}{p_{ij}\pi _i^2}(y_{ij}-\widehat{\theta }_{j,{\textit{AIPW}}})^2 \end{aligned}$$

under Poisson sampling.

Use a similar argument, we can show that the plug-in variance estimator is

$$\begin{aligned} {\hat{V}}_{ j,srs}&= n^{-2}\sum _{i=1}^n\frac{I_ir_{ij}(1-{\hat{p}}_{ij})}{{\hat{p}}_{ij}^2}(y_{ij}-{\widehat{a}}_{ij})^2+ n^{-1}(1-nN^{-1})\\&\quad \left[ n^{-1}\sum _{i=1}^N\frac{I_ir_{ij}(y_{ij}-\widehat{\theta }_{j,{\textit{AIPW}}})^2}{{\hat{p}}_{ij}} - \left( n^{-1}\sum _{i=1}^N\frac{I_ir_{ij}(y_{ij}-\widehat{\theta }_{j,{\textit{AIPW}}})}{{\hat{p}}_{ij}}\right) ^2 \right] \end{aligned}$$

under simple random sampling, and the one is

$$\begin{aligned} {\hat{V}}_{ j,pps}&= N^{-2}\sum _{i\in S}\frac{r_{ij}(1-{\hat{p}}_{ij})}{(nq_i)^2{\hat{p}}_{ij}^2}(y_{ij}-{\widehat{a}}_{ij})^2 + \{n(n-1)\}^{-1}\\&\quad \left\{ \sum _{i\in S}\frac{r_{ij}(y_{ij}-\widehat{\theta }_{j,{\textit{AIPW}}})^2}{{\hat{p}}_{ij}q_i^2} - n^{-1}\left( \sum _{i\in S}\frac{r_{ij}(y_{ij}-\widehat{\theta }_{j,{\textit{AIPW}}})}{{\hat{p}}_{ij}q_i}\right) ^2\right\} \end{aligned}$$

under probability-proportional-to-size sampling, where S is the index set of the sample, and $q_i$ is the selection probability of the ith element.

E Balanced repeated replication method

Consider a stratified multi-stage sampling design with two clusters selected per stratum for the first stage. Denote $w_{hik}$ to be the survey weight associated with $y_{hik}$, the kth sample element in the ith cluster of the hth stratum. The basic idea of the modified balanced repeated replication is to use the same estimation method based on the “reconstruct” the survey weight, that is,

$$\begin{aligned} w_{hik}^{(r)} = w_{hik} (1+\epsilon \delta _{rh}), \end{aligned}$$

where $\epsilon \in (0,1)$ is a predefined constant, and $\delta _{rh}=1$ or $\delta _{rh} = -1$ for the rth repetition. A set of R repetitions is said to be balanced if $\sum _{r=1}^R\delta _{rh}\delta _{rh'}=0$ for $h\ne h'$. The $R\times H$ matrix $(\delta _{rh})_{R\times H}$ can be obtained from a Hadamard matrix. Please check Rao and Shao (1999) for details about the modified balanced repeated replication method. In the simulation study and real data application, we choose $\epsilon = 1/2$ and $R =H$.

About this article

Cite this article

Mao, X., Wang, Z. & Yang, S. Matrix completion under complex survey sampling. Ann Inst Stat Math 75, 463–492 (2023). https://doi.org/10.1007/s10463-022-00851-5

Download citation

Received: 16 November 2021
Revised: 12 August 2022
Accepted: 17 August 2022
Published: 19 September 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10463-022-00851-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Matrix completion under complex survey sampling

Abstract

Access this article

Similar content being viewed by others

Investigating the Performance of a Variation of Multiple Correspondence Analysis for Multiple Imputation in Categorical Data Sets

Applying the Nonrandomized Diagonal Model to Estimate a Sensitive Distribution in Complex Sample Surveys

Variable selection in high-dimensional sparse multiresponse linear regression models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 218 KB)

Appendices

Appendix

A Technical conditions

B Lemmas

Lemma 1

Proof of Lemma 1

Lemma 2

Lemma 3

Lemma 4

C Proofs

1.1 C.1 Proof of Theorem 1

1.2 C.2 Proof of Theorem 2

1.3 C.3 Proof of Theorem 3

D Plug-in variance estimators

E Balanced repeated replication method

About this article

Cite this article

Keywords

Navigation

Matrix completion under complex survey sampling

Abstract

Access this article

Similar content being viewed by others

Investigating the Performance of a Variation of Multiple Correspondence Analysis for Multiple Imputation in Categorical Data Sets

Applying the Nonrandomized Diagonal Model to Estimate a Sensitive Distribution in Complex Sample Surveys

Variable selection in high-dimensional sparse multiresponse linear regression models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 218 KB)

Appendices

Appendix

A Technical conditions

B Lemmas

Lemma 1

Proof of Lemma 1

Lemma 2

Lemma 3

Lemma 4

C Proofs

1.1 C.1 Proof of Theorem 1

1.2 C.2 Proof of Theorem 2

1.3 C.3 Proof of Theorem 3

D Plug-in variance estimators

E Balanced repeated replication method

About this article

Cite this article

Share this article

Keywords

Search

Navigation