Brought to you by:
Paper The following article is Open access

Replica analysis of overfitting in regression models for time to event data: the impact of censoring

, and

Published 11 March 2024 © 2024 The Author(s). Published by IOP Publishing Ltd
, , Bayesian Statistics for Complex Systems Citation E Massa et al 2024 J. Phys. A: Math. Theor. 57 125003 DOI 10.1088/1751-8121/ad2e40

1751-8121/57/12/125003

Abstract

We use statistical mechanics techniques, viz. the replica method, to model the effect of censoring on overfitting in Cox's proportional hazards model, the dominant regression method for time-to-event data. In the overfitting regime, Maximum Likelihood (ML) parameter estimators are known to be biased already for small values of the ratio of the number of covariates over the number of samples. The inclusion of censoring was avoided in previous overfitting analyses for mathematical convenience, but is vital to make any theory applicable to real-world medical data, where censoring is ubiquitous. Upon constructing efficient algorithms for solving the new (and more complex) Replica Symmetric (RS) equations and comparing the solutions with numerical simulation data, we find excellent agreement, even for large censoring rates. We then address the practical problem of using the theory to correct the biased ML estimators without knowledge of the data-generating distribution. This is achieved via a novel numerical algorithm that self-consistently approximates all relevant parameters of the data generating distribution while simultaneously solving the RS equations. We investigate numerically the statistics of the corrected estimators, and show that the proposed new algorithm indeed succeeds in removing the bias of the ML estimators, for both the association parameters and for the cumulative hazard.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Inference relies on the foundations provided by classical statistical theory, which was developed for use in settings where the number p of covariates is small compared to the number n of observations [1, 2]. The standard Maximum Likelihood (ML) method for estimating model parameters indeed fails in the high-dimensional regime where $p = \mathcal{O}(n)$ [38], due to overfitting. Overfitting is the phenomenon that data noise is misinterpreted as signal, leading to biased parameter estimators with large sample to sample fluctuations. An overfitting model will predict outcomes for training examples well, but will fail in predicting outcomes for new data. Early strategies to mitigate overfitting include leaving out covariates (with the risk of overlooking relevant predictive information), penalization (equivalent to adding a parameter prior in Bayesian language), and shrinking parameter estimates after model fitting. The penalization weight or shrinking factor are typically estimated via bootstrapping or cross-validation, which forces one to sacrifice some of the training data; this makes the overfitting even worse. Moreover, penalization and shrinking may at best repair the overfitting-induced bias in the length of the parameter vector, but not the bias in its direction (which will appear as soon as the covariates are correlated [9]).

In order to use existing regression models also in the overfitting regime, it is vital to correctly quantify and undo the effects of overfitting, both for inference and for prediction purposes. As a consequence, in recent years we have seen increased research efforts aimed at extending the classical inference theory to the so-called proportional asymptotic regime, characterized by taking the limit $n\to\infty$ while keeping the ratio $\zeta = p/n$ fixed. Several mathematical methods from the domains of the statistical physics of disordered system [5, 810], computer science [7, 11] and statistics [3, 4, 6], have by now been applied successfully in this latter regime to model the statistics of ML and Maximum A Posteriori Probability (MAP) estimators.

The condition $p\ll n$ for ML/MAP inference to be used safely is especially problematic in post-genome medicine: we can now routinely measure very many variables per patient and want to use these to develop personalized therapies, but overfitting prevents us from doing so. For time-to-event data, the most common type in medicine, the main regression tools are variations on the model of Cox [12]. Overfitting in this model was analysed in [5, 13] as a statistical mechanics problem, via the replica method [14]. A later study [11] used the Convex Gaussian Min-max theorem [7], with the logarithm of Cox's partial likelihood as utility function, to study overfitting in the association parameters. The latter route avoids the use of replicas, but unlike [5, 13], cannot model overfitting in the base hazard rate.

Unlike some applied statistical studies on the subject, see e.g. [15, 16], all theoretical studies of overfitting based on the replica or cavity approach have so far assumed for simplicity that the data were not subject to censoring. Censoring means that samples are lost to observation prior to events occurring. It is ubiquitous in medicine, since medical studies are always of finite duration and also since patients often fail to return to hospital appointments for unknown reasons. Before the modern overfitting analysis methods for the proportional asymptotic regime, and their corresponding overfitting bias decontamination formulae, can be applied in the real world, it is hence vital that the theory of [5, 13] is extended to include censoring. That is the first aim of this paper.

We generalise the theory of [5, 13] by allowing for arbitrary types of non-informative censoring, carry out a replica analysis in the regime $p,n\to \infty$ with fixed $\zeta = p/N$, and derive the extended RS equations. Methodologically we also improve upon the approach in [5, 13] by: (i) using an more compact form of the replica approach, (ii) avoiding the variational approximation for the base hazard rate, and (iii) constructing novel and more powerful numerical algorithms for solving and inverting the RS equations, including in cases where one does not know anything about the data generating distribution, resulting in precise algorithms for decontaminating estimators of association parameters and the integrated base hazard for overfitting-induced bias.

This paper is organized as follows. In section 2 we define our notation and describe briefly the Cox model. We explain the results obtained via the replica method (whose derivation is relegated to an appendix) and the physical interpretation of the RS order parameters in section 3, and show in section 4 how this interpretation inspires an efficient numerical method for solving the RS equations. In section 5 we test the predictions of the theory against numerical simulations, and find perfect agreement. We construct in section 6 a new algorithm for simultaneously inferring relevant characteristics of the data generating distribution and solving the RS equations, leading to a realistic and practical tool for correcting the biased ML estimates of association parameters and the nuisance parameters (i.e. the integrated base hazard) in the overfitting regime, even in the presence of censoring. We conclude in section 7 with a discussion of our results and future research.

2. Definitions

In time to event data, each observation ideally reports the time T at which the subject experiences the event under investigation and the covariate vector $\boldsymbol{z} = (z_{1},\dots,z_{p})\in \mathbb{R}^p$, i.e. the list of characteristics of the subject measured at time zero. All covariate vectors are assumed to have been drawn independently from some distribution $p(\boldsymbol{z})$. Since any non-zero average will drop out of our formulae, we can always take $p(\boldsymbol{z})$ to have average zero. However, in virtually all real-world time-to-event data sets observations are censored, i.e. for some subjects we have a missing or partial observation of their event times. For instance, we might only know that an event occurred after or before a certain time point, or within a specific interval (called right, left and interval censoring, respectively). In what follows we focus on right censoring, the most common form of censoring in medical applications. In practice this amounts to assuming that when collecting data we actually observe

Equation (1)

The random variable C models the censoring time, drawn randomly from an a priori unknown distribution $p_c(x)$, and we are given a binary variable that indicates whether the subject has at time t experienced the event ($\Delta = 1$) or has been censored ($\Delta = 0$):

Equation (2)

Note that we can always write $p_{c}(x) = \lambda_c(x)\mathrm{e}^{-\Lambda_c(x)}$, where $\lambda_c(x)$ is the censoring rate and $\Lambda_c(x) = \int_0^x\!\mathrm{d} x^{\prime}~\lambda_c(x^{\prime})$. The Cox semi-parametric model [12] assumes that the time at which a subject experiences the event is distributed according to

Equation (3)

where $\boldsymbol{\beta}_0 \in \mathbb{R}^p$, $\lambda_0(x)$ is the true (non-negative) hazard rate, and $\Lambda_0(x) = \int_0^x\!\mathrm{d} x^{\prime}~\lambda_0(x^{\prime})$ is the true cumulative hazard. The censoring times C and the event times T are assumed to be statistically independent, i.e. censoring time is said to be non-informative, and the joint distribution of observed times $t\unicode{x2A7E} 0$ and event type indicators $\Delta\in\{0,1\}$, given covariates z , can then be written as

Equation (4)

The parameters one seeks to infer are the true vector $\boldsymbol{\beta}_0$ and the true function $\lambda_0(t)$. It follows from (4) upon computing the log-likelihood density for a data-set of i.i.d. observations $\mathcal{D} = \{(t_i,\Delta_i,\boldsymbol{z}_i)\}_{i = 1}^n$, that ML inference will give the estimators

Equation (5)

Equation (6)

The problem (5) can be reduced by first maximizing over the hazard rate. Setting the functional derivative with respect to λ to zero gives an equation that can be solved for $\lambda(t)$, leading to the so-called Breslow estimator [12]:

Equation (7)

where $\theta(x)$ denotes the step-function ($\theta(x\gt0) = 1$ and $\theta(x\lt0) = 0$). Substituting (7) into (6) and disregarding terms that are independ of β we obtain

Equation (8)

Equation (9)

The right-hand side of (9) is the logarithm of the famous Cox partial likelihood [12].

3. Typical behaviour of the Maximum (Partial) Likelihood estimator

Computing the estimator $\hat{\boldsymbol{\beta}}_\textrm{ML}$ in (8) is equivalent to finding the ground state of a fictitious physical system with Hamiltonian $\mathcal{H}_n(\boldsymbol{\beta}|\mathcal{D}) = - L_n(\boldsymbol{\beta}|\mathcal{D})$, in which the association parameters β act as the degrees of freedom, and the data-set $\mathcal{D} $ plays the role of quenched disorder. Statistical physics thus allows us to compute the average over the disorder (i.e. the data $\mathcal{D}$) of the log-partial likelihood, giving the typical behaviour of the ML estimator, in the proportional asymptotic regime as

Equation (10)

In fact, for reasons that will become clear, we insert a constant regularization factor (whose impact will be zero due to the limit $\gamma\to\infty$), and replace (10) by

Equation (11)

The standard procedure to carry out the disorder average in parametric regression models via the replica method (which goes back to [17]) builds on the assumption that the log-likelihood is a sum of n independent terms, each depending on a single observation $(t_i,\Delta_i,\boldsymbol{z}_i)$. For the present Hamiltonian $\mathcal{H}_n(\boldsymbol{\beta}|\mathcal{D})$ this is not the case, so we need an intermediate step. We introduce the empirical distribution

Equation (12)

and observe that we can write the Hamiltonian (i.e. minus the log-partial likelihood) as the following functional:

Equation (13)

Equation (14)

We can now write (11) in a form where the disorder average effectively is computed over the density of states associated with (12). Upon introducing suitable delta-functionals one then achieves factorisation of the disorder average over the samples i in the data set $\mathcal{D}$. The full replica derivation is given in appendix A for completeness, and gives upon making the so-called replica-symmetric (RS) ansatz:

Equation (15)

with

Equation (16)

Equation (17)

Equation (18)

and

Equation (19)

Here we used the standard short-hand $\textrm{D}y = (2\pi)^{-\frac{1}{2}}\mathrm{e}^{-\frac{1}{2}y^2}\mathrm{d} y$, and W(x) is the Lambert W-function [18], defined by the equation $W(x)\text{exp}[W(x)] = x$ for all x. The only condition on $p(\boldsymbol{z})$ needed in the derivation of the above RS equations is that for $p\to\infty$ the true and inferred linear predictors $\boldsymbol{\beta}\!\cdot\!\boldsymbol{z}$ acquire Gaussian statistics 4 . The extremum $(u_\star,v_\star,w_\star)$ in (15) is found to satisfy the so-called RS equations:

Equation (20)

Equation (21)

Equation (22)

These can be written in alternative forms, for instance by using identities such as $W(u^2\Lambda(t)\mathrm{e}^{\Delta u^2+vz+wy}) = u^2\Lambda(t)\mathrm{e}^{\xi(t,\Delta,y,z)} = \Delta u^2\!+\!vz\!+\!wy-\xi(t,\Delta,y,z)$, $W^{\prime}(x) = W(x)/x[1\!+\!W(x)]$, and via integration by parts over z in (20), giving

Equation (23)

Equation (24)

Equation (25)

For the distribution ${\mathcal P}(t,\Delta,h)$ and the functions $\Lambda(t)$ and ${\mathcal S}(t)$ we find in RS ansatz the following expressions (see appendix A):

Equation (26)

Equation (27)

Equation (28)

We note that an alternative way to derive equations (20)–(22) (and a corresponding equation for $\Lambda(t)$) would have been to make appropriate choices such as $y\to (t,\Delta)$ and $\theta\to\{\lambda_0(t),\lambda_C(t)\}$ in the RS formulae of [9]. However, that alternative route would not have generated the powerful expression (26) found with the present approach. By retracing its derivation and making simple adaptations, (26) can be generalized to

Equation (29)

where

Equation (30)

(with the standard convention that $\zeta = p/n$ is kept fixed in the limits $n,p\to\infty$). This result shows very transparently the impact of overfitting on the inferred risk factors $\boldsymbol{\beta}\cdot\boldsymbol{z}_i$, relative to their true values $\boldsymbol{\beta}_0\cdot\boldsymbol{z}_i$.

At this point it is helpful to write $w = S\kappa$. The interpretation of the values $(\kappa_{\star},v_{\star})$ as solved from the RS equations above can be inferred directly from the results in [9]:

Equation (31)

Equation (32)

with $p/n = \zeta$ fixed as $p,n\to\infty$. The importance of $(\kappa_\star,v_\star)$ derives from the facts that (asymptotically) the asymptotic distribution of $\hat{\boldsymbol{\beta}}_n$ depends only these two quantities [10], and that the overfitting induced inference bias and noise can be expressed in terms of $\kappa_\star$ and $v_\star$, respectively. The function $\Lambda(t)$ has the following interpretation:

Equation (33)

with $p/n = \zeta$ fixed, and with the Breslow estimator as given in (7).

4. Numerical solution of RS equations

For fully parametric GLM models in the overfitting regime the RS equations can always be solved numerically in a relatively straightforward iterative manner by evaluating the relevant integrals via Gaussian quadratures. Here, for the Cox model with censoring, we have the added complication of the functional order parameter $\Lambda(t)$, which suggests we take a different approach inspired by population dynamics algorithms and our interpretation (26)–(28). For any given values of the scalar order parameters $(u,v,w)$, and given values of $S = |\boldsymbol{A}^{1/2}\boldsymbol{\beta}_0|$, we can generate $m\gg 1$ samples from the distribution

Equation (34)

resulting in $\{(t_\ell,\Delta_\ell,y_\ell,z_\ell)\}_{\ell = 1}^m$. We then define for each $(u,v,w,\Lambda)$ the quantities $\xi_\ell(u,v,w,\Lambda) = \xi(t_\ell,\Delta_\ell,y_\ell,z_\ell|u,v,w,\Lambda)$, computed via (17), and the estimator $\tilde{\Lambda}_m(t)$:

Equation (35)

For sufficiently large population size m, the iterative algorithm defined by the mapping $\Lambda(.)\to \Lambda^{\prime}(.) = \tilde{\Lambda}_{m}(.|u,v,w,\Lambda)$ will now by construction have as its fixed-point the solution of (26)–(28). Furthermore, it is relatively easy to invert numerically equation (24) and write u for fixed $(v,w,\Lambda)$ as a function of ζ, for instance via Newton's method, so that ζ can always be chosen as the independent parameter of our RS equations (as it is always known). We can then solve the following surrogate set of RS equations:

Equation (36)

Equation (37)

Equation (38)

Equation (39)

where now

Equation (40)

(e.g. via damped fixed point iteration). When simulating the data, $\Lambda_0(.)$ is known and, empirically, we find that substituting equation (38) with its equivalent version, obtained by Gaussian integration by parts of (25),

Equation (41)

leads to much faster convergence with smaller population size and hence the latter is adopted instead of (38).

5. Simulations tests of RS theory

When some observations are censored, it is known that in the proportional asymptotic regime the ML estimator $\hat{\boldsymbol{\beta}}_\textrm{ML}$ does not always exists [20], and a recent paper [11] established that, asymptotically and under the assumption of Gaussian covariates, $\hat{\boldsymbol{\beta}}_{n}$ undergoes a sharp phase transition at some critical value ζc . Here we focus on the region $\zeta\lt\zeta_c$ where the ML estimator does exist. Since one obviously cannot solve the RS equations for all possible choices of the hazard function $\lambda_0(.)$, the censoring rate $\lambda_c(.)$ and the true association amplitude $S = |\boldsymbol{A}^{1/2}\boldsymbol{\beta}_0|$, we have in our simulations chosen a censoring distribution $p_c(t)$ that is uniform between 0 and $t_\textrm{max}$, reflecting the realistic scenario of a clinical trial of duration $t_\textrm{max}$ where patients are recruited at constant rate for the trial duration. Different choices of S, $t_\textrm{max}$ and true hazard rate $\lambda_0(.)$ will lead to different expected fractions $\langle \Delta\rangle$ of true (non-censored) events:

Equation (42)

To test the predictions of our RS equations we generated 500 survival data-sets $\mathcal{D}$ of size $n = |\mathcal{D}| = 400$, with uniformly distributed censoring on the interval $[0,t_\textrm{max}]$, using cumulative hazards of the Log-logistic $\Lambda_0(t) = \text{log}(1+t^2)$ and Weibull $\Lambda_0(t) = \frac{1}{2}t^2$ form. We used uncorrelated unit-variance Gaussian covariates, and true association vector $\boldsymbol{\beta}_0 = \hat{\mathbf{e}}_1$ (i.e. the first unit vector), hence S = 1. We then applied Cox's ML regression protocol to infer the associations and the cumulative hazard (via Breslow's formula), giving the estimators $\hat{\boldsymbol{\beta}}_n$ and $\hat{\Lambda}_n(.)$. From $\hat{\boldsymbol{\beta}}_n$ we subsequently computed the overfitting markers $(\hat{\kappa}_n,\hat{v}_n)$ in (31) and (32).

Below we show results obtained when $t_\textrm{max} = 4$, corresponding to $\langle \Delta\rangle\approx 0.6$, implying that around 40% of the events are censored, for both choices of $\Lambda_0(.)$. This relatively large censored fraction was deliberately chosen to confirm that the new theory is capable of predicting the behaviour of the various estimators, despite the complication posed by censoring. Further evidence in the form of additional simulations for different values of $t_\textrm{max}$ and hence of $\langle\Delta\rangle$ is presented in appendix C. There we considered $t_\textrm{max} = 2$ and $t_\textrm{max} = 6$, corresponding to $\langle \Delta\rangle\approx 40\%$ and 70%, respectively. In real medical data sets the fraction of censored events depends heavily on the kind of risk under study and on the duration of the clinical study. For example, a fast acting risk will lead to small numbers of censored individuals, since most will experience the primary event before the end of the study.

In figure 1 we plot the inferred cumulative hazards $\hat{\Lambda}_n(.)$ versus the true cumulative hazards $\Lambda_0(.)$, which can be done unambiguously since both are always non-decreasing functions of time. By the nature of the Breslow estimator (7), these curves take the form of 'staircase' functions. We also show in the same panels the RS prediction of the relation between the two quantities, as red dashed lines. We conclude from figure 1 that in all cases the solution of the RS equations correctly predict the typical values of $\hat{\Lambda}_n(.)$. In figure 2 we compare the replica predictions $(\kappa_\star,v_\star)$ (solid lines) with the corresponding measurements $(\hat{\kappa}_n,\hat{v}_n)$ (markers with error bars), for different values of $\zeta = p/n\in[0,0.5]$. Markers give the averages over the 500 data set realizations; error bars indicate the standard deviations. Again we observe excellent agreement between the RS theory and the simulations, in spite of the relatively small sample size n = 400. The latter feature is important for applications of the theory, since real data sets in survival analysis indeed often have sizes of that order.

Figure 1.

Figure 1. Comparison of theoretical predictions and simulation data, for Cox's survival analysis model with uniform censoring on the interval $[0,4]$, giving around 40% censoring events, and data set size n = 400. Top row: p = 100 (so ζ = 0.25); bottom row: p = 200 (so ζ = 0.5). Panels (a), (c): cumulative hazard $\Lambda_0(t) = \text{log}(1\!+\!t^2)$. Panels (b), (d): cumulative hazard $\Lambda_0(t) = \frac{1}{2}t^2$. In all panels we plot with black lines the Breslow estimator $\hat{\Lambda}_n(.)$ versus the true cumulative hazard $\Lambda_0(.)$, for 500 independent simulations with distinct data realizations. Yellow curves show the predictions of the RS theory (solved with populations of size $m = 10^5$), and red curves indicate the diagonal (that would have been found for perfect regression). Further details are given in the main text.

Standard image High-resolution image
Figure 2.

Figure 2. Comparison of theoretical predictions and simulation data, for Cox's survival analysis model with uniform censoring on the interval $[0,4]$, giving around 40% censoring events, and data set size n = 400. Top row: overfitting-induced bias factor $\hat{\kappa}_n$ versus ζ; bottom row: overfitting-induced noise amplitude $\hat{v}_n$ versus ζ. Panels (a), (c): data for cumulative hazard $\Lambda_0(t) = \text{log}(1+t^2)$. Panels (b), (d): data for cumulative hazard $\Lambda_0(t) = \frac{1}{2}t^2$. In all panels we plot with markers and error bars the simulation results, averaged over 500 independent simulations with distinct data realizations, and with solid lines the predictions $\kappa_\star$ and $v_\star$ of the RS theory (solved with populations of size $m = 10^5$).

Standard image High-resolution image

6. De-biasing protocols for overfitted ML estimators

Given the agreement between theory and simulations, we now explore the potential of using our theory as a systematic tool with which to decontaminate ML estimators and build asymptotically unbiased estimators $\tilde{\boldsymbol{\beta}}_n$ and $\tilde{\Lambda}_n(.)$. Two obstacles appear initially to hinder direct application of our RS equations for bias decontamination. First, our equations involve the amplitude S (which is not directly accessible), Second, bias decontamination requires inverting the complex relation between the inferred and true integrated hazards. We have already constructed an algorithm for computing $\hat{\Lambda}_n(.)$ in terms of $\Lambda_0(.)$, but now we need to compute $\Lambda_0(.)$ given $\hat{\Lambda}_n(.)$. In this section we remove both obstacles, and show that the RS theory can indeed be used for creating accurate unbiased estimators in the overfitting regime.

6.1. Derivation of a practical algorithm for de-biasing

Given the assumptions of our theory, we may write the marginal outcome probability $p(t,\Delta) = \int\!\mathrm{d}\boldsymbol{z}~p(\boldsymbol{z})p(t,\Delta|\boldsymbol{\beta}_0\!\cdot\!\boldsymbol{z})$ as

Equation (43)

Hence the log-marginal likelihood density for n i.i.d. observations $\{(t_i,\Delta_i)\}$ reads

Equation (44)

where we introduced the function

Equation (45)

We next compute the ML estimators for $(\lambda,\lambda_c)$ by maximization of (44). Taking the functional derivative with respect to $\lambda(.)$ gives, after standard manipulations, the following estimator for the hazard rate:

Equation (46)

Upon integrating both sides over t, and replacing in the right-hand side the (as yet unknown) value of $\Lambda(t_i)$ by the estimator $\tilde{\Lambda}_n(t|S) = \int_0^t\!\mathrm{d} t^{\prime}~ \tilde{\lambda}_n(t^{\prime}|S)$ one then finds the following simple fixed-point equation for $\tilde{\Lambda}_n(.|S)$, that does not involve any parameter that could be susceptible to overfitting:

Equation (47)

The result of solving this fixed-point equation by (damped) iteration is remarkably accurate, as will be shown below. Repeating the same steps for the censoring rate $\lambda_c(.)$ gives the unbiased ML estimator

Equation (48)

One could in principle attempt to determine S in the same way: extremize (44) with respect to S, and solve the resulting equation simultaneously with (47) for $\tilde{\Lambda}(.|S)$ and S. Unfortunately, the resulting equations admit the trivial solution S = 0, describing the trivial situation where the outcomes $(t,\Delta)$ are independent of the covariates. Thus we need an independent extra equation to determine the value of S, which must be added to and solved simultaneously with the RS order parameter equations and (47). We first note that (31) and (32) imply

Equation (49)

with $p/n = \zeta$ fixed as $p,n\to\infty$. If the covariate correlation matrix A is known, the above could serve as the desired extra equation. If the covariate correlation matrix is not known, as would generally be the case, we can use the following result, which exploits the fact that in regression the n quantities $\hat{\boldsymbol{\beta}}_\textrm{ML}\!\cdot\!\boldsymbol{z}_i$ are always observable:

Equation (50)

This identity is derived in appendix B. It uses explicitly the alternative form of the replica calculation followed in the present paper, from which followed equation (26) (a result that could have been but was not derived in earlier papers). In contrast to (49) and (50) can always be used as a additional equation with which to determine S.

The final result is the following algorithm, that seeks to combine optimally the RS theory with the available data $\mathcal{D}$ and the biased ML estimators. After measuring the left-hand side of (50), one solves numerically the four scalar equations (23)–(25) and (50) simultaneously for $(u,v,w,S)$, e.g. via (damped) fixed-point iteration, with the short-hands (18) for the function $\xi(t,\Delta,y,z)$ and (17) for $p(t,\Delta|Sy)$. The censoring rate $\lambda_c(t)$ in (17) is estimated by the time derivative of (48), and for each value of S, the base hazard $\lambda_0(t)$ by the time derive of the result of iterating (47) to a fixed-point. The various Gaussian integrals can be done via Gauss-Hermite quadrature. The resulting new and unbiased estimator for the association parameters is according to (31) then given by $\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}_\textrm{ML}/\kappa_\star$, where $\kappa_\star = w_\star/S$.

6.2. Simulation tests of the proposed new estimators

In figure 3 we plot both the uncorrected and the de-biased estimator for the cumulative hazard, for simulations of survival data with n = 400 and around 40% of censoring events. These data show convincingly that the ML estimator $\hat{\Lambda}_{n}(.)$ (shown in black) is quite biased, with average values as predicted by the RS theory (dashed yellow line), but that the corrected estimator $\tilde{\Lambda}(.)$ (shown in blue) when plotted against the true cumulative hazard is centered around the red dashed line (the diagonal, corresponding to unbiased estimation). This is remarkable, given that our de-biasing algorithm does not require any knowledge of the data generating process, apart from assuming the form of a semi-parametric proportional hazard model (4). As expected, the corrected estimator exhibits a larger variance than the ML estimator (reflecting the general and well-known bias-variance trade-off in inference). The increasing variance of both estimators at large times is simply a finite size effect: already at t = 2.5 the survival probability is below 0.1, so only few subjects are still alive and able to generate events that may contribute to hazard rate inference. Similarly we show in figures 46 the differences between the ML estimators and the new de-biased estimators for the amplitude S (the 'signal strength' of the true associations), the nonzero and zero components of the true association vector respectively. Additional simulations for different values of the censoring rate have been carried out, and the corresponding results can be found in appendix C.

Figure 3.

Figure 3. Comparison of uncorrected and corrected integrated hazard rate estimators derived from simulation data, for Cox's survival analysis model with uniform censoring on the interval $[0,4]$, giving around 40% censoring events, and data set size n = 400. In all cases $\boldsymbol{\beta}_0 = \hat{\mathbf{e}}_1$, so S = 1. Top row: p = 120 (so ζ = 0.3); bottom row: p = 200 (so ζ = 0.4). Panels (a), (c): cumulative hazard $\Lambda_0(t) = \text{log}(1\!+\!t^2)$. Panels (b), (d): cumulative hazard $\Lambda_0(t) = \frac{1}{2}t^2$. In all panels we plot with black lines the Breslow estimator $\hat{\Lambda}_{n}(.)$ versus the true cumulative hazard $\Lambda_0(.)$, for 500 independent simulations with distinct data realizations. Yellow curves show the predictions of the RS theory (solved with populations of size $m = 10^5$), and red curves indicate the diagonal (that would have been found for perfect regression). We also show in blue the de-biased estimator $\tilde{\Lambda}(.)$ for the cumulative hazard versus the true cumulative hazard $\Lambda_0(.)$, for the same 500 simulations.

Standard image High-resolution image
Figure 4.

Figure 4. Distributions of the values of the effective signal strength $S = |\boldsymbol{A}^{1/2}\boldsymbol{\beta}_0|$ inferred by our de-biasing algorithm, for the same data as those used in figure 3. In all panels the maximum of the histogram corresponds to the true value S = 1, in spite of the relatively modest data set size (around 240 non-censored events).

Standard image High-resolution image
Figure 5.

Figure 5. Distributions of the values of the non-zero component $\hat{\mathbf{e}}_1\!\cdot\!\boldsymbol{\beta}$ of the association vector, both for the standard ML Cox regression estimator (dark grey) and for the new estimator inferred by our de-biasing algorithm (light grey), for the same data as those used in figure 3. We also show as a vertical dashed line the location of the true value $\hat{\mathbf{e}}_1\!\cdot\!\boldsymbol{\beta}_0 = 1$, and as a solid curve the predicted Gaussian asymptotic distribution of the new estimator, with average 1 and variance $v_{\star}^2/ \kappa^2_{\star}p$. We observe that the new estimator indeed removed successfully the overfitting-induced bias of the ML one, while its distribution exhibits finite size effects.

Standard image High-resolution image
Figure 6.

Figure 6. Distributions of the values of the zero component $\hat{\mathbf{e}}_2\!\cdot\boldsymbol{\beta}$ of the association vector, both for the standard ML Cox regression estimator (dark grey) and for the new estimator inferred by our de-biasing algorithm (light grey), for the same data as those used in figure 3. We also show as a vertical dashed line the location of the true value $\hat{\mathbf{e}}_2\!\cdot\!\boldsymbol{\beta}_0 = 0$, and as a solid curve the predicted Gaussian asymptotic distribution of the new estimator, here with zero average and variance $v_{\star}^2/ \kappa^2_{\star}p$. Here, as expected, already the ML estimator was unbiased, since the theory predicts the overfitting-induced bias to take the form of a multiplicative factor $\kappa_\star$ (which would here just multiply the number zero).

Standard image High-resolution image

Our simulations support the conclusion that the new decontamination algorithm proposed above indeed leads to virtually unbiased estimators for $\Lambda_0(.)$, S , and $\boldsymbol{\beta}_0$, even for relatively small data sets and in the presence of significant levels of censoring. This is a nontrivial result, since we have only required that there is no model mismatch, i.e. that the data were generated from a Cox model, and that the distribution of the risk factors $\boldsymbol{\beta}_0\cdot\boldsymbol{z}_i$ will asymptotically be Gaussian. In particular, the algorithm does not require a priori knowledge of any of the parameters of the data generating process.

7. Conclusion

In this paper we have extended previous analytical studies on overfitting in high dimensional Cox regression for time-to-event data [5, 9, 13], by including censoring. Censoring is a necessary ingredient for any survival analysis theory if its results are to be applied to real medical data (the main application area of survival analysis). While still using the replica method as our main tool, in addition to the inclusion of censoring we have here also chosen a different route to carry out the analysis compared to [5, 9, 13], which enabled us to derive an explicit formula for the inferred distribution of risk factors. This latter formula, in turn, paved the way for the creation of new algorithms for applications of the replica-symmetric (RS) theory, that previously required additional knowledge and/or additional approximations. All our new theoretical results and predictions are validated using numerical simulations.

We consider the main deliverables of the present paper to be the following three. First, we now have an accurate theory with which to understand quantitatively the impact of censoring on overfitting phenomena in the Cox model, so that replica-based overfitting analysis can now be applied to realistic real-world problems in medicine (i.e. to data from clinical studies of finite as opposed to infinite duration). Second, we are now able to solve directly and accurately the RS equations for the inferred cumulative hazard, instead of using the variational approximations proposed in [5, 9, 13]. Thirdly, we have been able to construct from the present theory a practical and accurate algorithm with which to decontaminate inferences for overfitting-induced bias, that does not require knowledge of any of the parameters of the true data generating distribution. The latter algorithm combines in a very effective way the RS equations with those quantities that are measurable in ML regression, and infers the effective signal strength $S = |\boldsymbol{A}^{1/2}\boldsymbol{\beta}_0|$ as a by-product.

The results presented here thus have not only a theoretical but also a very practical value. The availability of tools for de-biasing the important estimators of the Cox model implies that one can now achieve significantly improved outcome predictions in realistic clinical scenarios, compared to what would have been achievable with ML regression alone, either by using more covariates for prediction (i.e. increasing $\zeta = p/n$ by increasing p), or by using fewer patients to do so (i.e. increasing $\zeta = p/n$ by decreasing n), or by allowing for increased levels of censoring without detrimental consequences. To the best of our knowledge, the presently proposed bias decontamination algorithm, that corrects both association parameters and nuisance parameters (i.e. the cumulative hazard in the Cox model), is the first of its kind.

We envisage future work to involve application of the new estimator decontamination algorithm to real (partially censored) survival data, finding an expression or equation for the transition point ζc of the Cox model in the presence of censoring(via the statistical mechanics route), and working out the RS theory upon including ridge penalization in the Cox formulae for survival analysis with censoring. We also envisage generalizing the new decontamination algorithm to more general GLM regression models, where the efficient computation of the signal strength S, or other possible parameters of the true data generating model, and of the nuisance parameters, have in the past always required unwelcome approximations that we expect will now no longer be necessary.

Acknowledgment

A M would like to thank Mansoor Sheikh and Fabian Aguirre Lopez for constructive discussions. This work was partially supported by the Medical Research Council of the United Kingdom (Grant MR/L01257X/1).

Data availability statement

The data that support the findings of this study are available upon reasonable request from the authors.

Appendix A: Replica derivation

We start from formulae (12)–(14) for the Hamiltonian, being minus the partial log-likelihood of the Cox model. The relevant information on the typical estimators $(\hat{\boldsymbol{\beta}},\hat{\lambda})_\textrm{ML}$ follows from the asymptotic disorder-averaged free energy density in the $\gamma\to\infty$ limit, viz. from $f = \lim_{\gamma\to\infty}f(\gamma)$, where

Equation (A.1)

Equation (A.2)

To compute the average of the logarithm we use the replica method, giving

Equation (A.3)

A.1. The replicated free energy density

In order to compute $\big\langle [Z_n(\gamma,\mathcal{D})]^r\big\rangle_{\mathcal{D}}$ we first write $Z_n(\gamma,\mathcal{D})$ in a more convenient form. We introduce the functional delta distribution

Equation (A.4)

in which $\mathcal{D}\hat{\mathcal{P}}$ and $\mathcal{D}\mathcal{P}$ are normalized such that $\int\!\mathcal{D}{\mathcal{P}}~ \delta\big[\mathcal{P}(.) \big] = 1$. We can now write

Equation (A.5)

For integer r we can next do the average over the data $\mathcal{D}$:

Equation (A.6)

in which $\boldsymbol{x} = (x_0,\ldots,x_r)\in \mathbb{R}^{r+1}$, $p(\boldsymbol{z})$ is the distribution from which the n covariate vectors z i were drawn, $p(t,\Delta|\boldsymbol{\beta}_0\!\cdot\!\boldsymbol{z})$ is defined in (4), and

Equation (A.7)

The integrand in the last line of (A.6) depends on the vectors $\boldsymbol{\beta}_{\alpha}$ only via the linear predictors $x_\alpha = \boldsymbol{\beta}_\alpha\!\cdot\!\boldsymbol{z}$. We assume that the distribution $W(\boldsymbol{x}|\{\boldsymbol{\beta}\})$ of these predictors is Gaussian, either because $p(\boldsymbol{z})$ is Gaussian, or because for $p\to\infty$ the central limit theorem will apply. Since $\int\!\mathrm{d}\boldsymbol{z}~p(\boldsymbol{z})\boldsymbol{z} = \textbf{0}$, we may now write, with $\boldsymbol{C} = \{C_{\alpha\rho}\}$:

Equation (A.8)

Equation (A.9)

To describe the generation or explanation of finite event times, the linear predictors will always have to remain finite. Hence for realistic data sets the matrix elements Cαρ must remain of order O(1) even in the limit $p\to\infty$ 5 . If we now also insert the integral $1 = \int\!\mathrm{d}\boldsymbol{C}~\delta[\boldsymbol{C}-\boldsymbol{C}[\{\boldsymbol{\beta}\}]]$, and write the δ-function in integral form, we find:

Equation (A.10)

where, using $p = \zeta n$,

Equation (A.11)

with

Equation (A.12)

Equation (A.13)

It now follows, after exchanging the limits $n\to\infty$ and r → 0, that the asymptotic disorder-average free energy density (A.3) can be computed via steepest descent:

Equation (A.14)

A.2. Saddle point integration

We first derive the stationary conditions for (A.11) with respect to the functions $\hat{\mathcal{P}}_{\alpha}$ and ${\mathcal{P}}_{\alpha}$, via functional differentiation and using (14). This gives, respectively,

Equation (A.15)

Equation (A.16)

Combination of (7) with (12) enables us to recognize in (A.16) the replicated cumulative hazard function,

Equation (A.17)

Equation (A.18)

with which we may write (A.16) more compactly as

Equation (A.19)

Insertion into (A.15) then simplifies the latter to

Equation (A.20)

Upon using also a modest amount of foresight we transform $\mathrm{i}\hat{\boldsymbol{C}} = \frac{1}{2}\boldsymbol{D}$. We then find that our saddle point problem (A.14) can be written as follows:

Equation (A.21)

where

Equation (A.22)

with

Equation (A.23)

Equation (A.24)

We have not used (A.20) to derive (A.22), only (A.19). Hence in (A.21) the functions $\{{\mathcal P}_\alpha\}$ indeed still have the status of independent variational parameters, and functional differentiation of (A.22) with respect to ${\mathcal P}_\alpha$ must therefore reproduce (A.20).

A.3. The replica symmetry (RS) ansatz

We now make the so-called replica symmetric (RS) ansatz. Replica symmetry should affect only the nonzero replica labels $\alpha = 1\ldots r$, so for the present order parameters C and D the RS ansatz takes the following form:

Equation (A.25)

Equation (A.26)

Equation (A.27)

The RS form of C implies the same for its inverse, where we define $(\boldsymbol{C}^{-1})_{00} = \tilde{\mu}$ and

Equation (A.28)

Equation (A.29)

After some algebra one then finds that

Equation (A.30)

Equation (A.31)

Equation (A.32)

Equation (A.33)

With the RS ansatz we can simplify the functions (A.23) and (A.24). We start with $\tilde{\phi}[\boldsymbol{D}]$, using the short-hand $S^2 = C_{00} = \boldsymbol{\beta}_0\cdot\boldsymbol{A}\boldsymbol{\beta}_0$:

Equation (A.34)

Next we apply the RS ansatz to $\tilde{\varphi}[\boldsymbol{C}]$:

Equation (A.35)

Hence

Equation (A.36)

We then find, using identities such as $\textrm{Tr}(\boldsymbol{D}\boldsymbol{C}) = D_{00}S^2+r(2c_0\hat{m}\!+\!c\hat{q}\!+\!C\hat{\rho})+{\mathcal O}(r^2)$, $\text{log}\textrm{Det}\boldsymbol{C} = \text{log} S^2+r[\text{log}(C\!-\!c)+(c\!-\!c_0^2/S^2)/(C\!-\!c)]+{\mathcal O}(r^2)$ (obtained by diagonalizing the RS form of C ), as well as $\tilde{\mu}S^2 = 1+rc_0^2/S^2(C\!-\!c)+{\mathcal O}(r^2)$, $\tilde{m} = -c_0/S^2(C\!-\!c)+{\mathcal O}(r)$, $\tilde{q} = (c_0^2/S^2\!-\!c)/(C\!-\!c)^2+{\mathcal O}(r)$, and $\tilde{\rho}-\tilde{q} = (C\!-\!c)^{-1}+{\mathcal O}(r)$, that the RS ansatz implies the following for expression (A.22) (modulo irrelevant constants):

Equation (A.37)

We can now extremize $\Psi_\textrm{RS}[\ldots]$ over $(\hat{m},\hat{q},\hat{\rho})$, as demanded by (A.21), giving

Equation (A.38)

Our various formulae takes more compact forms upon introducing the three short-hands $u = \sqrt{\gamma(C\!-\!c)}$, $v = \sqrt{c\!-\!c_0^2/S^2}$ and $w = c_0/S$, which we know from earlier studies to have finite $\gamma\to\infty$ limits. We may then write

Equation (A.39)

and

Equation (A.40)

We may hence now write $f_\textrm{RS} = \lim_{\gamma\to\infty} f_\textrm{RS}(\gamma)$ as follows:

Equation (A.41)

with $\mathcal{E}[\mathcal{P}(.)]$ as defined by (14), and

Equation (A.42)

Equation (A.43)

The maximization over x in (A.41) is achieved for $x = u^{-2}\xi(t,\Delta,y,z)$, where

Equation (A.44)

with Lambert's function W(z), the inverse of the function $g(x) = x\mathrm{e}^x$. Note that expression (A.44) can be interpreted in terms of a Moreau envelope, and is in other approaches to overfitting indeed often found via that route [8]. Hence we now arrive at

Equation (A.45)

which can be written in alternative ways, using identities such as $z\mathrm{e}^{-W(z)} = W(z)$.

A.4. Replica symmetric order parameter equations

We next work out the RS form of equation (A.20) in the limit r → 0, using the same manipulations as in working out the previous function $\tilde{\varphi}_\textrm{RS}[\boldsymbol{C}]$, giving

Equation (A.46)

Upon taking the limit $\gamma\to\infty$ we then find the following remarkably simple result, in which the function $\xi(\ldots)$ is as given by equation (A.44):

Equation (A.47)

From this one then obtains the RS expressions of $\Lambda(t)$ and ${\mathcal S}(t)$, via (A.42) and (A.43).

Finally we need to work out the scalar order parameter equations for the trio $(u,v,w)$, by executing the extremization demanded in expression (A.45) for the free energy density. These equations are seen to take the form $\partial{\cal F}(u,v,w)/\partial u = \partial{\cal F}(u,v,w)/\partial v = \partial{\cal F}(u,v,w)/\partial w = 0$, with $\xi(\ldots)$ as given in (A.44), where

Equation (A.48)

By using identities such as $W(x)\text{exp}[W(x)] = x$, one finds that $u^2\Lambda(t)\text{exp}[\xi(t,\Delta,y,z)] = W(u^2\Lambda(t)\mathrm{e}^{\Delta u^2+vz+wy})$, which enables us to simplify ${\cal F}(u,v,w)$ to

Equation (A.49)

From this expression one then derives directly the final order parameter equations for $(u,v,w)$, in which only the equation for u involves some further manipulations:

Equation (A.50)

Equation (A.51)

Equation (A.52)

Finally, after some simple manipulations and usage of (A.52) one finds that at the saddle point $(u_\star,v_\star,w_\star)$ the value of ${\cal F}$ in expression (A.48) simplifies to

Equation (A.53)

Appendix B: Variance of inferred risk factors

In this appendix we derive equation (50) for the variance of the inferred risk factors $\hat{\boldsymbol{\beta}}_\textrm{ML}\!\cdot\!\boldsymbol{z}_i$, which is an important tool with which to remove the need for knowing the amplitude S in the construction of unbiased estimators. It exploits our result (26) for the asymptotic form of the distribution (12) (which is the type of quantity that within the replica theory is assumed to be self-averaging 6 ):

Equation (B.1)

Upon using (18) we can thus write

Equation (B.2)

Upon using the order parameter equations (20)–(22) we can now simplify the right-hand side of the latter expression, and find the remarkably simple but powerful result

Equation (B.3)

Appendix C: Additional simulations

This appendix describes tests of the theory against simulations for further choices of the censoring rate, i.e. for different values of $\langle\Delta\rangle$. In C.1 we compare the measurements in additional simulations with the solutions of the RS equations, as in section 5. In C.2 we apply the algorithm of section 6 for different values of $\langle\Delta\rangle$, to investigate the robustness of the proposed algorithm. The data are generated as explained in the main text, i.e. the cumulative hazard is either of the Weibull form $\Lambda_0(t) = \frac{1}{2} t^2$ or of the Log-logistic form $\Lambda_0(t) = \text{log} (1+t^2)$, the true signal strength is $|\boldsymbol{\beta}_0| = 1$, and the censoring is uniformly distributed in the interval $[0,t_{\text{max}}]$. In order to vary the fraction of censored individuals we tune the end-of-trial time $t_{\text{max}}$. In particular, we consider the values $t_{\text{max}} = 2$ and $t_{\text{max}} = 6$.

In all cases it can be seen that the RS theory predicts the typical values of the estimators very accurately (even for n = 400).

C.1. Comparison of theory and simulations

In figures C1 and C2 we compare the predicted values of the overlaps $(\kappa_{\star},v_{\star})$, obtained by solving the RS equations, against the values $(\hat{\kappa}_n,\hat{v}_n)$ in simulations, for different values of ζ and for $t_{\text{max}} \in\{2,6\}$ (corresponding to $\langle\Delta\rangle\approx 40\%$ and $70\% $, respectively). We also show in figure C3 the comparison between the Breslow estimator $\hat{\Lambda}_n(.)$ (solid black lines) against the solution of the RS equations (yellow solid line) when ζ = 0.3 at $t_{\text{max}} = 2.0$ (top row) and when ζ = 0.4 at $t_{\text{max}} = 6$ (bottom row). As a reference we plot the identity line (solid red), which corresponds to perfect recovery.

Figure C1.

Figure C1. Comparison of theoretical predictions and simulation data, for Cox's survival analysis model with uniform censoring on the interval $[0,2]$ ($\langle\Delta\rangle\approx 40\% $) and data set size n = 400. Top row: overfitting-induced bias factor $\hat{\kappa}_n$ versus ζ; bottom row: overfitting-induced noise amplitude $\hat{v}_n$ versus ζ. Panels (a), (c): data for cumulative hazard $\Lambda_0(t) = \text{log}(1+t^2)$ . Panels (b), (d): data for cumulative hazard $\Lambda_0(t) = \frac{1}{2}t^2$. In all panels we plot with markers and error bars the simulation results, averaged over 500 independent simulations with distinct data realizations, and with solid lines the predictions $\kappa_\star$ and $v_\star$ of the RS theory (solved with populations of size $m = 10^5$).

Standard image High-resolution image
Figure C2.

Figure C2. Comparison of theoretical predictions and simulation data, for Cox's survival analysis model with uniform censoring on the interval $[0,6]$ ($\langle\Delta\rangle\approx 70\% $) and data set size n = 400. Top row: overfitting-induced bias factor $\hat{\kappa}_n$ versus ζ; bottom row: overfitting-induced noise amplitude $\hat{v}_n$ versus ζ. Panels (a), (c): data for cumulative hazard $\Lambda_0(t) = \text{log}(1+t^2)$ ($\langle\Delta\rangle \approx 50 \%$). Panels (b), (d): data for cumulative hazard $\Lambda_0(t) = \frac{1}{2}t^2$ ($\langle\Delta\rangle \approx 40 \%$). In all panels we plot with markers and error bars the simulation results, averaged over 500 independent simulations with distinct data realizations, and with solid lines the predictions $\kappa_\star$ and $v_\star$ of the RS theory (solved with populations of size $m = 10^5$).

Standard image High-resolution image
Figure C3.

Figure C3. Comparison of theoretical predictions and simulation data, for Cox's survival analysis model with uniform censoring on the interval $[0,t_{\text{max}}]$ and data set size n = 400. Top row: p = 120 (ζ = 0.3), $t_{\text{max}} = 2.0$ ($\langle\Delta\rangle\approx 40\%$); bottom row : p = 160 (ζ = 0.4), $t_{\text{max}} = 6.0$ ($\langle\Delta\rangle\approx 70\%$). Panels (a), (c): cumulative hazard $\Lambda_0(t) = \text{log}(1\!+\!t^2)$ . Panels (b), (d): cumulative hazard $\Lambda_0(t) = \frac{1}{2}t^2$. In all panels we plot with black lines the Breslow estimator $\hat{\Lambda}_n(.)$ versus the true cumulative hazard $\Lambda_0(.)$, for 500 independent simulations with distinct data realizations. Yellow curves show the predictions of the RS theory (solved with populations of size $m = 10^5$), and red curves indicate the diagonal (that would have been found for perfect regression).

Standard image High-resolution image

C.2. Simulations test for the proposed decontamination algorithm

In figure C4 we compare the Breslow estimator $\hat{\Lambda}_n(.)$ (solid black lines) with its decontamined version $\tilde{\Lambda}_n(.)$ (blue solid lines), when ζ = 0.3 at $t_{\text{max}} = 2$ (top row) and when ζ = 0.4 at $t_{\text{max}} = 6$ (bottom row). As a reference we also plot the diagonal (solid red), which corresponds to perfect estimation. We also show in figure C5 the histograms of the estimator Sn for S, as obtained via the decontamination algorithm, together with $\tilde{\Lambda}_n(.),\tilde{\kappa}_n,$ and $ \tilde{v}_n$. The true value S = 1 is plotted as dashed vertical line in black. Finally we compare the histograms of the ML estimators $\mathbf{e}_1\cdot\hat{\boldsymbol{\beta}}_n$ and $\mathbf{e}_2\cdot\hat{\boldsymbol{\beta}}_n$ against their corrected values $\mathbf{e}_1\cdot\tilde{\boldsymbol{\beta}}_{n}$ and $ \mathbf{e}_2\cdot\tilde{\boldsymbol{\beta}}_n$, with $\tilde{\boldsymbol{\beta}}_n = \hat{\boldsymbol{\beta}}_n/\kappa_n$, and we superimpose the normal density with mean zero and variance $v_{\star}^2/\kappa^2_{\star}$ (the prediction of the RS theory). Since here $\boldsymbol{\beta}_0 = \mathbf{e}_1$, the true values are $\mathbf{e}_1\cdot\boldsymbol{\beta}_0 = 1 $ and $\mathbf{e}_2\cdot\boldsymbol{\beta}_0 = 0$.

Figure C4.

Figure C4. Comparison of uncorrected and corrected integrated hazard rate estimators derived from simulation data, for Cox's survival analysis model with uniform censoring on the interval $[0,t_{\text{max}}]$and data set size n = 400. In all cases $\boldsymbol{\beta}_0 = \hat{\mathbf{e}}_1$, so S = 1. Top row: p = 120 (so ζ = 0.3), $t_{\text{max}} = 2.0$; bottom row: p = 160 (so ζ = 0.4), $t_{\text{max}} = 6.0$ . Panels (a), (c): cumulative hazard $\Lambda_0(t) = \text{log}(1\!+\!t^2)$. Panels (b), (d): cumulative hazard $\Lambda_0(t) = \frac{1}{2}t^2$. In all panels we plot with black lines the Breslow estimator $\hat{\Lambda}_{n}(.)$ versus the true cumulative hazard $\Lambda_0(.)$, for 500 independent simulations with distinct data realizations. Yellow curves show the predictions of the RS theory (solved with populations of size $m = 10^5$), and red curves indicate the diagonal (that would have been found for perfect regression). We also show in blue the de-biased estimator $\tilde{\Lambda}(.)$ for the cumulative hazard versus the true cumulative hazard $\Lambda_0(.)$, for the same 500 simulations.

Standard image High-resolution image
Figure C5.

Figure C5. Distributions of the values of the effective signal strength $S = |\boldsymbol{A}^{1/2}\boldsymbol{\beta}_0|$ inferred by our de-biasing algorithm, for the same data as those used in figures C4, C6 and C7. Top row: p = 120 (so ζ = 0.3), $t_{\text{max}} = 2.0$; bottom row: p = 160 (so ζ = 0.4), $t_{\text{max}} = 6.0$. In all panels the maximum of the histogram corresponds to the true value S = 1, in spite of the relatively modest data set size (around 240 non-censored events).

Standard image High-resolution image
Figure C6.

Figure C6. Distributions of the values of the non-zero component $\hat{\mathbf{e}}_1\!\cdot\!\boldsymbol{\beta}$ of the association vector, both for the standard ML Cox regression estimator (dark grey) and for the new estimator inferred by our de-biasing algorithm (light grey), for the same data as those used in figures C4, C5 and C7. Top row: p = 120 (so ζ = 0.3), $t_{\text{max}} = 2.0$; bottom row: p = 160 (so ζ = 0.4), $t_{\text{max}} = 6.0$. We also show as a vertical dashed line the location of the true value $\hat{\mathbf{e}}_1\!\cdot\!\boldsymbol{\beta}_0 = 1$, and as a solid curve the predicted Gaussian asymptotic distribution of the new estimator, with average 1 and variance $v_{\star}^2/ \kappa^2_{\star}p$. We observe that the new estimator indeed removed successfully the overfitting-induced bias of the ML one, while its distribution exhibits finite size effects.

Standard image High-resolution image
Figure C7.

Figure C7. Distributions of the values of the zero component $\hat{\mathbf{e}}_2\!\cdot\boldsymbol{\beta}$ of the association vector, both for the standard ML Cox regression estimator (dark grey) and for the new estimator inferred by our de-biasing algorithm (light grey), for the same data as those used in figures C4C6. Top row: p = 120 (so ζ = 0.3), $t_{\text{max}} = 2.0$; bottom row: p = 160 (so ζ = 0.4), $t_{\text{max}} = 6.0$. We also show as a vertical dashed line the location of the true value $\hat{\mathbf{e}}_2\!\cdot\!\boldsymbol{\beta}_0 = 0$, and as a solid curve the predicted Gaussian asymptotic distribution of the new estimator, here with zero average and variance $v_{\star}^2/ \kappa^2_{\star}p$. Here, as expected, already the ML estimator was unbiased, since the theory predicts the overfitting-induced bias to take the form of a multiplicative factor $\kappa_\star$ (which would here just multiply the number zero).

Standard image High-resolution image

In all cases we see that the proposed algorithm succeeds in correcting for the overfitting bias: in figure C4 the blue lines are distributed around lies around the (red) diagonal, and in figure C5 the modes of the histograms of $S_n,\mathbf{e}_1\cdot \tilde{\boldsymbol{\beta}}_n$ and $\mathbf{e}_2\cdot \tilde{\boldsymbol{\beta}}_n$ reproduce the true values $S = 1,~\mathbf{e}_1\cdot\boldsymbol{\beta}_0 = 1$ and $\mathbf{e}_2\cdot \boldsymbol{\beta}_0 = 0$.

In order for our algorithm to converge, it is necessary that the ML estimator actually exists. Hence we expect that the algorithm will converge when $\zeta = p/n$ is sufficiently far from the critical value that marks the censoring equivalent of the phase transition derived in [11]. In a finite size simulation that transition is not sharp, and there will therefore be data instances even below the phase transition line for which the ML estimator does not exist. The (future) inclusion of a ridge penalization should cure these pathologies, which are due to the use of ML regression, rather than to the algorithm itself. Nevertheless, estimating correctly the hazard rate of the event under study will always be more difficult when only a few primary events are observed in a sample. This implies that any algorithm must inevitably become less reliable as $\langle\Delta\rangle$ decreases toward zero.

Footnotes

  • This will be trivially true if the covariate distribution $p(\boldsymbol{z})$ is itself Gaussian, but also if the conditions for the central limit theorem to hold apply (i.e. if the components of z not too strongly correlated, and the components of β are not too dissimilar in their scaling with n; see [19]).

  • In earlier papers [5, 9, 10, 13] the definition (A.7) involved $\boldsymbol{\beta}_\alpha\!\cdot\!\boldsymbol{z}/\sqrt{p}$ instead of $\boldsymbol{\beta}_\alpha\!\cdot\!\boldsymbol{z}$, so the components of $\boldsymbol{\beta}_\alpha$ would typically be of order ${\mathcal O}(1)$. The present choice links more directly to the standard ML algorithms, but is the reason why in (11) the regularizer constant $\frac{1}{2}p\text{log} p$ was required.

  • This assumption is validated by the accuracy of the predictions of the RS equations, as confirmed repeatedly in previous studies [5, 9, 10, 21], and also again in our present numerical simulations.

Please wait… references are loading.