Two seemingly paradoxical results in linear models: the variance inflation factor and the analysis of covariance

Peng Ding

doi:10.1515/jci-2019-0023

Open Access Published by De Gruyter March 23, 2021

Two seemingly paradoxical results in linear models: the variance inflation factor and the analysis of covariance

Peng Ding

From the journal Journal of Causal Inference

https://doi.org/10.1515/jci-2019-0023

Abstract

A result from a standard linear model course is that the variance of the ordinary least squares (OLS) coefficient of a variable will never decrease when including additional covariates into the regression. The variance inflation factor (VIF) measures the increase of the variance. Another result from a standard linear model or experimental design course is that including additional covariates in a linear model of the outcome on the treatment indicator will never increase the variance of the OLS coefficient of the treatment at least asymptotically. This technique is called the analysis of covariance (ANCOVA), which is often used to improve the efficiency of treatment effect estimation. So we have two paradoxical results: adding covariates never decreases the variance in the first result but never increases the variance in the second result. In fact, these two results are derived under different assumptions. More precisely, the VIF result conditions on the treatment indicators but the ANCOVA result averages over them. Comparing the estimators with and without adjusting for additional covariates in a completely randomized experiment, I show that the former has smaller variance averaging over the treatment indicators, and the latter has smaller variance at the cost of a larger bias conditioning on the treatment indicators. Therefore, there is no real paradox.

Keywords: Causal inference; Conditioning; Design-based inference; Potential outcomes; Randomization; Rerandomization

MSC 2010: 62-01; 62A01; 62J10

1 Variance inflation factor

Consider the following linear regression:

(1) yi=α+τzi+β'xi+εi, (i=1,…,n)

where z_i is a scalar and x_i is a scalar or vector. Without loss of generality, we center x so that x¯=n-1∑i=1nxi=0 . In a leading example, z_i is the treatment variable and x_i contains all the pre-treatment covariates. Using a standard result in linear models, we can write the OLS estimator for τ as

τ^a=∑i=1nzˇiyi∑i=1nzˇi2,

where ž_i is the residual from the OLS fit of z_i on (1, x_i). This result is also called the Frisch–Waugh–Lovell theorem in econometrics; see [1] for a recent review. If the regressors (z_i, x_i)'s are all fixed and the ɛ_i's are independent and identically distributed (IID) with mean 0 and variance σ² as in the classic linear model, then the variance of τ^a equals

(2) var(τ^a)=∑i=1nzˇi2var(yi)(∑i=1nzˇi2)2=σ2∑i=1nzˇi2=σ2∑i=1n(zi-z¯)2×∑i=1n(zi-z¯)2∑i=1nzˇi2.

The first term of (2) equals the variance of

(3) τ^=∑i=1n(zi-z¯)yi∑i=1n(zi-z¯)2,

i.e., the coefficient of z_i in the OLS fit of y_i on (1, z_i) without adjusting for x_i. The second term of (2) is the VIF, which is no smaller than 1 because it is the total sum of squares divided by the residual sum of squares in the OLS fit of z_i on (1, x_i). The VIF can be equivalently written as (1-Rz|x2)-1 , where Rz|x2 is the sample R² between z_i and x_i. So the VIF result can also be written as

var(τ^a)=var(τ^)×(1-Rz|x2)-1.

It highlights the bias-variance tradeoff: with more covariates included, the model is closer to the truth and thus leads to a smaller bias in estimating τ, but at the same time, it results in a larger variance of τ^a . See [2], [3], [4] and [5] for textbook discussions.

Thus, from (2), the variance of var(τ^a) will never decrease with more covariates in (1), because the residual sum of squares ∑i=1nzˇi2 will decrease while the total sum of squares ∑i=1n(zi-z¯)2 is constant. An immediate result is

var(τ^a)≥var(τ^),

and the equality holds when X′Z = 0, where Z = (z₁, . . . , z_n)^′ is the vector formed by regressors z_i's and X=(x1',…,xn')′ is the matrix formed by the regressors x_i's. The orthogonality of regressors (i.e., X′Z = 0) ensures that Rz|x2=0 and τ^=τ^a .

2 Analysis of covariance

Now we consider a special case of (1): the x_i's are pre-treatment covariates, the z_i's are the binary treatment indicators (1 for the treatment and 0 for the control), and the y_i's are the outcomes of interest. Then (1) is the standard ANCOVA model [6], and the parameter of interest τ is the treatment effect. Let n1=∑i=1nzi and n0=∑i=1n(1-zi) be the numbers of units under the treatment and control, respectively. As in Section 1, we assume that the ɛ_i's are IID with mean 0 and variance σ². In a completely randomized experiment, we further assume that the z_i's are from random permutation of n₁ 1's and n₀ 0's.

Because z_i is binary, we can define

δ^x=n1-1∑i=1nzixi-n0-1∑i=1n(1-zi)xi, δ^ε=n1-1∑i=1nziεi-n0-1∑i=1n(1-zi)εi

as the differences in means of x and ɛ, respectively. Further let β^ be the OLS estimator for β in (1). The estimator τ^ in (3) without adjusting for covariates simplifies to the difference in means of the outcome:

τ^=n1-1∑i=1nziyi-n0-1∑i=1n(1-zi)yi,

which further simplifies to

(4) τ^=n1-1∑i=1nzi(α+τzi+β'xi+εi)-n0-1∑i=1n(1-zi)(α+τzi+β'xi+εi)=τ+β'δ^x+δ^ε,

under (1). Based on OLS of (1), the estimator τ^a adjusting for the covariates simplifies to

τ^a=n1-1∑i=1nzi(yi-β^'xi)-n0-1∑i=1n(1-zi)(yi-β^'xi)=τ^-β^'δ^x=τ+(β-β^)'δ^x+δ^ε.

With large samples, we can ignore the term (β-β^)′δ^x above to obtain

τ^a≈τ+δ^ε,

because (β-β^)′δ^x=OP(n-1) is of higher order due to β^-β=OP(n-1/2) and δ^x=OP(n-1/2) , both justified by central limit theorems under certain moment conditions. See [7] for technical details.

Under complete randomization, we can show that

E(δ^ε)=0, E(δ^x)=0

based on a standard result for the differences in means [8],

(5) var(δ^ε)=nn1n0σ2, var(δ^x)=nn1n0Sx2,

where Sx2=(n-1)-1∑i=1n(xi-x¯)(xi-x¯)' is the finite population covariance of the x_i's [9], and moreover, the uncorrelatedness of the two differences in means [7]

cov(δ^ε,δ^x)=0.

Then E(τ^)=τ and E(τ^a)≈τ , i.e., τ^ is unbiased and τ^a is consistent for τ. Their variances satisfies

var(τ^)-var(τ^a)≈var(β'δ^x)=nn1n0β'Sx2β≥0.

Thus, if β ≠ 0 then ANCOVA improves estimation efficiency, at least asymptotically. See [10], [11] and [12] for textbook discussions.

3 From conflict to unification

3.1 A unified data generating process

From the VIF result, we see that adding more covariates never decreases the variance of an OLS coefficient. In contrast, from the ANCOVA result, we see that adding more covariates never increases the variance of an OLS coefficient at least asymptotically. These two results are both standard in textbooks of linear models or experimental design. However, they seem to give opposite conclusions. Both results are derived under the linear model (1), and therefore, these two conflicting results seem paradoxical.

If we go back to check the derivations above carefully, we will find that Section 1 assumes that the z_i's and x_i's are both fixed, but Section 2 assumes that the z_i's are random and the x_i's are fixed. Therefore, the VIF and the ANCOVA results hold under different assumptions on the treatment indicators. This vaguely explains the paradox.

Technically, the settings for VIF and ANCOVA are slightly different. For example, z_i can be general and may have an arbitrary correlation structure with x_i in the VIF result, but it is binary and arises from complete randomization in the ANCOVA result. The data generating process below comes from the intersection of the settings for the two results, which allows for a more unified discussion of them.

Consider the following data generating process: for i = 1, . . . , n,

fix the x_i's and center them at x¯=n-1∑i=1nxi=0 ;
generate the potential outcomes under control as
yi(0)=α+β'xi+εi,
where ℰ = (ɛ₁, . . . , ɛ_n) are IID with mean 0 and variance σ²;
generate the potential outcomes under treatment as
yi(1)=yi(0)+τ,
i.e., the individual treatment effect y_i(1) − y_i(0) is constant;
generate Z = (z₁, . . . , z_n) from a random permutation of n₁ 1's and n₀ 0's;
obtain the observed outcome as
(6) yi=ziyi(1)+(1-zi)yi(0)=τzi+yi(0)=α+τzi+β'xi+εi.

In (b) and (c), I use the potential outcomes notation due to [8]. Readers who are uncomfortable with the notation y_i(1) and y_i(0) can ignore steps (b) and (c) and view (6) as the data generating process with random ɛ_i's and z_i's.

In the above data generating process, τ represents the individual and thus the average treatment effect. It is the parameter of interest.

3.2 Comparing the variances

Conditional on Z, (6) is a linear model with fixed (z_i, x_i)'s and homoskedastic errors ɛ_i's. The discussion in Section 1 applies in this case. Then from the VIF result, we know that

(7) var(τ^a|Z)≥var(τ^|Z),

i.e., the estimator adjusting for covariates x_i's has larger variance. However, τ^a is unbiased but τ^ is biased. From the classic OLS theory,

E(τ^a|Z)=τ,

and from the formula of (4), the bias of τ^ is

E(τ^|Z)-τ=β'δ^x.

Therefore, the smaller conditional variance of τ^ comes at the cost of having a larger conditional bias. The conditional bias of τ^ vanishes only in the special cases with β = 0 or δ^x=0 , that is, the covariates are unrelated with the outcome, or the covariates are perfectly balanced in means across the treatment and control groups.

Averaging over Z, we have random potential outcomes and random treatment indicators. The discussion in Section 2 applies in this case. We have shown that

E(τ^)=τ, E(τ^a)≈τ,

and moreover, asymptotically (ignoring higher order terms),

(8) var(τ^)≥var(τ^a).

Mathematically, the efficiency reversal results in (7) and (8) do not lead to contradiction given the explicitly specified conditioning sets. Statistically, however, they form a paradox that is similar to the classic Simpson's paradox of effect reversal due to different conditioning sets. I give a simple explanation of this paradox using the following decompositions based on the law of total variance:

var(τ^a)=E{var(τ^a|Z)}+var{E(τ^a|Z)}=E{var(τ^a|Z)}+var(τ)=E{var(τ^a|Z)}

and

var(τ^)=E{var(τ^|Z)}+var{E(τ^|Z)}=E{var(τ^|Z)}+var(τ+β'δ^x)=E{var(τ^|Z)}+var(β'δ^x).

Based on the VIF result, E{var(τ^a|Z)}≥E{var(τ^|Z)} , but their difference is small because the Rz|x2 between z_i and x_i is close to zero under complete randomization. We can ignore their difference in the asymptotic analysis. More importantly, the unadjusted estimator has an additional variance term due to the conditonal bias which reverses the ordering of var(τ^a) and var(τ^) .

3.3 Comparing the estimated variances

Sections 1 and 2 compare the variances of τ^a and τ^ which are theoretical quantities depending on the unknown true data generating process. In practice, standard statistical software packages report the estimated variances based on OLS:

var^(τ^a)=σ^y|z,x2∑i=1n(zi-z¯)2×11-Rz|x2

and

var^(τ^)=σ^y|z2∑i=1n(zi-z¯)2,

where σ^y|z,x2 equals the residual sum of squares divided by n − 2 − dim(x) in the OLS fit of y_i on (1, z_i, x_i), and σ^y|z2 equals the residual sum of squares divided by n − 2 in the OLS fit of y_i on (1, z_i). The ratio of these two variances depends on σ^y|z,x2/σ^y|z2 and Rz|x2 , which can be larger or smaller than 1. Importantly, this is a numeric result regardless of whether or not we condition on Z.

In fact, under the data generating process in Section 3.1, var^(τ^a) is often smaller than var^(τ^) as long as the covariates are predictive to the outcome. This is true due to two basic facts: first, Rz|x2 is close to 0 under complete randomization so the VIF can be ignored asymptotically; second, σ^y|z,x2 is often smaller than σ^y|z2 because the residual sum of squares decreases with an additional predictive covariate. This argument ignores the opposite impact of the degrees of freedom correction, which is reasonable when the sample size is large and the dimension of covariates is small. See [13] for the discussion of R² with high dimensional covariates.

The above heuristic comparison of var^(τ^a) and var^(τ^) is not in contradiction with the VIF result which concerns the true variances conditional on Z. The estimated variances can be different from the true variances, especially when the linear model of y_i on (1, z_i) is misspecified.

4 Connection with randomization inference

4.1 Variances conditional on the error terms

Another conditioning scheme leads to the following discussion beyond Sections 1. and 2. Conditional on the error terms ℰ, we have fixed potential outcomes and completely randomized Z. Statistics under this regime is called randomization inference, or design-based inference. The classic results from randomization inference are E(τ^|ℰ)=τ [8], E(τ^a|ℰ)≈τ [14, 15], and var(τ^|ℰ)≥var(τ^a|ℰ) asymptotically [14, 15]. This relative efficiency is coherent with var(τ^)≥var(τ^a) asymptotically. Again we can explain the coherence using the following decompositions based on the law of total variance:

var(τ^a)=E{var(τ^a|ℰ)}+var{E(τ^a|ℰ)}≈E{var(τ^a|ℰ)}+var(τ)=E{var(τ^a|ℰ)}

and

var(τ^)=E{var(τ^|ℰ)}+var{E(τ^|ℰ)}=E{var(τ^|ℰ)}+var(τ)=E{var(τ^|ℰ)},

where both τ^a and τ^ are unbiased for τ at least asymptotically. These results are all in favor of ANCOVA which improves the estimation efficiency. I summarize the results in Table 1.

Table 1

Comparison under the data generating process in (a)–(e)

	mean of τ^a	mean of τ^	variance comparison
unconditional	E(τ^a)=τ	E(τ^)=τ	var(τ^a)≤var(τ^) asymptotically
conditional on Z	E(τ^a\|Z)=τ	E(τ^\|Z)=τ+β'δ^x	var(τ^a\|Z)≥var(τ^\|Z)
conditional on ℰ	E(τ^a\|ℰ)≈τ	E(τ^\|ℰ)=τ	var(τ^a\|ℰ)≤var(τ^\|ℰ) asymptotically

4.2 More general potential outcomes model

The data generating process in (a)–(e) assumes constant treatment effect and homoskedastic errors. It yields the standard ANCOVA model (1) or (6). The literature of randomization-based causal inference often deals with more general potential outcomes, without requiring these linear model assumptions [7, 8, 9, 14, 15]. In general cases with possibly misspecified linear models, [14] criticized ANCOVA by showing that it might increase or decrease the efficiency compared to τ^ . As a response to this critique, [15] proposed to use a modified ANCOVA estimator that also includes the interaction term z_i × x_i in OLS, and showed that this estimator is at least as efficient as τ^ asymptotically. See [16] and [17] for related discussions.

4.3 A design issue

From the above discussion, δ^x is a key quantity that causes the paradox. If it is zero, then τ^a=τ^ and the paradox disappears. Complete randomization ensures E(δ^x)=0 , but in a particular allocation, δ^x can differ from zero. From the experimental design perspective, δ^x measures the covariate balance across the treatment and control groups. Complete randomization ensures covariate balance on average, but a particular allocation may have covariate imbalance. [18] and [19] proposed to use rerandomization to improve the data generating process (d) by forcing the treatment indicators Z to satisfy

δ^x'cov(δ^x)-1δ^x=δ^x'(nn1n0Sx2)-1δ^x≤c0,

where c₀ > 0 is a predetermined threshold. Under the randomization inference framework, [20] show that this new experimental design improves the efficiency of τ^ , which is, in fact, close to the efficiency of τ^a for small c₀ ≈ 0. From our discussion before, rerandomization can also reduce the conditional bias of τ^ given Z because it forces δ^x to be small for any realized value of Z. Therefore, rerandomization can mitigate the paradox through experimental design. See [21] for more unified discussions.

5 Final remarks

I have shown that the seemingly paradoxical results of VIF and ANCOVA are due to different statistical assumptions. The key difference is whether or not statistical inference is conditional on the treatment indicators Z. Conditioning on Z, the unadjusted estimator has a smaller variance but larger bias. Averaging over Z, both unadjusted and adjusted estimators are consistent for τ but the variance of the adjusted estimator is no larger than that of the unadjusted estimator. In randomized experiments, we recommend using ANCOVA under a constant treatment effect model or its modified version for general settings [15, 21].

I end this note with two minor technical issues. First, I assume that the x_i's are fixed throughout the paper. With random covariates, we can condition on them to obtain the same results. The key in the discussion is whether or not to condition on Z. Second, if the Z_i's are IID Bernoulli random variables as in a Bernoulli experiment, we can condition on (n₁, n₀) to reduce the discussion to complete randomization.

Acknowledgement

The author thanks Ugur Yildirim for raising this question in his class of “Stat 230A Linear Models” at the University of California Berkeley, Luke Miratrix for the inspiring discussion of estimated variances in Section 3.3, and Anqi Zhao, Xinran Li, Liyun Chen, Zhichao Jiang, Jason Wu, and Yuting Ye for helpful suggestions. A reviewer made many constructive comments which helped to improve the paper significantly. This research was partially supported by the U. S. National Science Foundation (grant # 1945136).

Conflict of Interests: Prof. Peng Ding is a member of the Editorial Board of the Journal of Causal Inference although had no involvement in the final decision.

References

[1] P. Ding. The Frisch–Waugh–Lovell theorem for standard errors. Statistics and Probability Letters, 168:108945, 2021.10.1016/j.spl.2020.108945Search in Google Scholar

[2] G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning. New York: Springer, 2013.10.1007/978-1-4614-7138-7Search in Google Scholar

[3] J. J. Faraway. Linear Models with R. Boca Raton: Chapman and Hall/CRC, 2016.10.1201/b17144Search in Google Scholar

[4] J. Fox. Applied Regression Analysis and Generalized Linear Models. Newbury Park, CA: Sage Publications, 2015.Search in Google Scholar

[5] A. Agresti. Foundations of Linear and Generalized Linear Models. New York: John Wiley & Sons, 2015.Search in Google Scholar

[6] R. A. Fisher. Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd, 1stst edition, 1925.Search in Google Scholar

[7] X. Li and P. Ding. General forms of finite population central limit theorems with applications to causal inference. Journal of the American Statistical Association, 112:1759–1769, 2017.10.1080/01621459.2017.1295865Search in Google Scholar

[8] J. Neyman. On the application of probability theory to agricultural experiments: Essay on principles, Section 9. Masters Thesis. Portions translated into english by D. Dabrowska and T. Speed (1990). Statistical Science, 5:465–472, 1923.10.1214/ss/1177012031Search in Google Scholar

[9] G. W. Imbens and D. B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. New York: Cambridge University Press, 2015.10.1017/CBO9781139025751Search in Google Scholar

[10] O. Kempthorne. The Design and Analysis of Experiments. New York: Wiley, 1952.10.1097/00010694-195205000-00012Search in Google Scholar

[11] K. Hinkelmann and O. Kempthorne. Design and Analysis of Experiments, Volume 1, Introduction to Experimental Design, 2nd Edition. New York: John Wiley & Sons, 2007.10.1002/9780470191750Search in Google Scholar

[12] D. R. Cox and N. Reid. The Theory of the Design of Experiments. Boca Raton: Chapman and Hall/CRC, 2000.10.1201/9781420035834Search in Google Scholar

[13] D. A. Freedman. A note on screening regression equations. The American Statistician, 37:152–155, 1983.10.1080/00031305.1983.10482729Search in Google Scholar

[14] D. A. Freedman. On regression adjustments to experimental data. Advances in Applied Mathematics, 40:180–193, 2008.10.1016/j.aam.2006.12.003Search in Google Scholar

[15] W. Lin. Agnostic notes on regression adjustments to experimental data: Reexamining Freedman's critique. Annals of Applied Statistics, 7:295–318, 2013.10.1214/12-AOAS583Search in Google Scholar

[16] A. A. Tsiatis, M. Davidian, M. Zhang, and X. Lu. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: A principled yet flexible approach. Statistics in Medicine, 27:4658–4677, 2008.10.1002/sim.3113Search in Google Scholar PubMed PubMed Central

[17] A. Negi and J. M. Wooldridge. Revisiting regression adjustment in experiments with heterogeneous treatment effects. Econometric Reviews, page in press, 2020.10.1080/07474938.2020.1824732Search in Google Scholar

[18] D. R. Cox. Randomization and concomitant variables in the design of experiments. In P. R. Krishnaiah G. Kallianpur and J. K. Ghosh, editors, Statistics and Probability: Essays in Honor of C. R. Rao, pages 197–202. North-Holland, Amsterdam, 1982.Search in Google Scholar

[19] K. L. Morgan and D. B. Rubin. Rerandomization to improve covariate balance in experiments. The Annals of Statistics, 40:1263–1282, 2012.10.1214/12-AOS1008Search in Google Scholar

[20] X. Li, P. Ding, and D. B. Rubin. Asymptotic theory of rerandomization in treatment-control experiments. Proceedings of the National Academy of Sciences of the United States of America, 115:9157–9162, 2018.10.1073/pnas.1808191115Search in Google Scholar PubMed PubMed Central

[21] X. Li and P. Ding. Rerandomization and regression adjustment. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82:241–268, 2020.10.1111/rssb.12353Search in Google Scholar

Received: 2019-08-18

Accepted: 2021-02-05

Published Online: 2021-03-23

This work is licensed under the Creative Commons Attribution 4.0 International License.

Two seemingly paradoxical results in linear models: the variance inflation factor and the analysis of covariance

Abstract

1 Variance inflation factor

2 Analysis of covariance

3 From conflict to unification

3.1 A unified data generating process

3.2 Comparing the variances

3.3 Comparing the estimated variances

4 Connection with randomization inference

4.1 Variances conditional on the error terms

4.2 More general potential outcomes model

4.3 A design issue

5 Final remarks

Acknowledgement

References

Journal and Issue

Articles in the same Issue