Abstract
We use statistical mechanics techniques, viz. the replica method, to model the effect of censoring on overfitting in Cox's proportional hazards model, the dominant regression method for time-to-event data. In the overfitting regime, Maximum Likelihood (ML) parameter estimators are known to be biased already for small values of the ratio of the number of covariates over the number of samples. The inclusion of censoring was avoided in previous overfitting analyses for mathematical convenience, but is vital to make any theory applicable to real-world medical data, where censoring is ubiquitous. Upon constructing efficient algorithms for solving the new (and more complex) Replica Symmetric (RS) equations and comparing the solutions with numerical simulation data, we find excellent agreement, even for large censoring rates. We then address the practical problem of using the theory to correct the biased ML estimators without knowledge of the data-generating distribution. This is achieved via a novel numerical algorithm that self-consistently approximates all relevant parameters of the data generating distribution while simultaneously solving the RS equations. We investigate numerically the statistics of the corrected estimators, and show that the proposed new algorithm indeed succeeds in removing the bias of the ML estimators, for both the association parameters and for the cumulative hazard.
Export citation and abstract BibTeX RIS
Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
1. Introduction
Inference relies on the foundations provided by classical statistical theory, which was developed for use in settings where the number p of covariates is small compared to the number n of observations [1, 2]. The standard Maximum Likelihood (ML) method for estimating model parameters indeed fails in the high-dimensional regime where [3–8], due to overfitting. Overfitting is the phenomenon that data noise is misinterpreted as signal, leading to biased parameter estimators with large sample to sample fluctuations. An overfitting model will predict outcomes for training examples well, but will fail in predicting outcomes for new data. Early strategies to mitigate overfitting include leaving out covariates (with the risk of overlooking relevant predictive information), penalization (equivalent to adding a parameter prior in Bayesian language), and shrinking parameter estimates after model fitting. The penalization weight or shrinking factor are typically estimated via bootstrapping or cross-validation, which forces one to sacrifice some of the training data; this makes the overfitting even worse. Moreover, penalization and shrinking may at best repair the overfitting-induced bias in the length of the parameter vector, but not the bias in its direction (which will appear as soon as the covariates are correlated [9]).
In order to use existing regression models also in the overfitting regime, it is vital to correctly quantify and undo the effects of overfitting, both for inference and for prediction purposes. As a consequence, in recent years we have seen increased research efforts aimed at extending the classical inference theory to the so-called proportional asymptotic regime, characterized by taking the limit while keeping the ratio fixed. Several mathematical methods from the domains of the statistical physics of disordered system [5, 8–10], computer science [7, 11] and statistics [3, 4, 6], have by now been applied successfully in this latter regime to model the statistics of ML and Maximum A Posteriori Probability (MAP) estimators.
The condition for ML/MAP inference to be used safely is especially problematic in post-genome medicine: we can now routinely measure very many variables per patient and want to use these to develop personalized therapies, but overfitting prevents us from doing so. For time-to-event data, the most common type in medicine, the main regression tools are variations on the model of Cox [12]. Overfitting in this model was analysed in [5, 13] as a statistical mechanics problem, via the replica method [14]. A later study [11] used the Convex Gaussian Min-max theorem [7], with the logarithm of Cox's partial likelihood as utility function, to study overfitting in the association parameters. The latter route avoids the use of replicas, but unlike [5, 13], cannot model overfitting in the base hazard rate.
Unlike some applied statistical studies on the subject, see e.g. [15, 16], all theoretical studies of overfitting based on the replica or cavity approach have so far assumed for simplicity that the data were not subject to censoring. Censoring means that samples are lost to observation prior to events occurring. It is ubiquitous in medicine, since medical studies are always of finite duration and also since patients often fail to return to hospital appointments for unknown reasons. Before the modern overfitting analysis methods for the proportional asymptotic regime, and their corresponding overfitting bias decontamination formulae, can be applied in the real world, it is hence vital that the theory of [5, 13] is extended to include censoring. That is the first aim of this paper.
We generalise the theory of [5, 13] by allowing for arbitrary types of non-informative censoring, carry out a replica analysis in the regime with fixed , and derive the extended RS equations. Methodologically we also improve upon the approach in [5, 13] by: (i) using an more compact form of the replica approach, (ii) avoiding the variational approximation for the base hazard rate, and (iii) constructing novel and more powerful numerical algorithms for solving and inverting the RS equations, including in cases where one does not know anything about the data generating distribution, resulting in precise algorithms for decontaminating estimators of association parameters and the integrated base hazard for overfitting-induced bias.
This paper is organized as follows. In section 2 we define our notation and describe briefly the Cox model. We explain the results obtained via the replica method (whose derivation is relegated to an appendix) and the physical interpretation of the RS order parameters in section 3, and show in section 4 how this interpretation inspires an efficient numerical method for solving the RS equations. In section 5 we test the predictions of the theory against numerical simulations, and find perfect agreement. We construct in section 6 a new algorithm for simultaneously inferring relevant characteristics of the data generating distribution and solving the RS equations, leading to a realistic and practical tool for correcting the biased ML estimates of association parameters and the nuisance parameters (i.e. the integrated base hazard) in the overfitting regime, even in the presence of censoring. We conclude in section 7 with a discussion of our results and future research.
2. Definitions
In time to event data, each observation ideally reports the time T at which the subject experiences the event under investigation and the covariate vector , i.e. the list of characteristics of the subject measured at time zero. All covariate vectors are assumed to have been drawn independently from some distribution . Since any non-zero average will drop out of our formulae, we can always take to have average zero. However, in virtually all real-world time-to-event data sets observations are censored, i.e. for some subjects we have a missing or partial observation of their event times. For instance, we might only know that an event occurred after or before a certain time point, or within a specific interval (called right, left and interval censoring, respectively). In what follows we focus on right censoring, the most common form of censoring in medical applications. In practice this amounts to assuming that when collecting data we actually observe
The random variable C models the censoring time, drawn randomly from an a priori unknown distribution , and we are given a binary variable that indicates whether the subject has at time t experienced the event () or has been censored ():
Note that we can always write , where is the censoring rate and . The Cox semi-parametric model [12] assumes that the time at which a subject experiences the event is distributed according to
where , is the true (non-negative) hazard rate, and is the true cumulative hazard. The censoring times C and the event times T are assumed to be statistically independent, i.e. censoring time is said to be non-informative, and the joint distribution of observed times and event type indicators , given covariates z , can then be written as
The parameters one seeks to infer are the true vector and the true function . It follows from (4) upon computing the log-likelihood density for a data-set of i.i.d. observations , that ML inference will give the estimators
The problem (5) can be reduced by first maximizing over the hazard rate. Setting the functional derivative with respect to λ to zero gives an equation that can be solved for , leading to the so-called Breslow estimator [12]:
where denotes the step-function ( and ). Substituting (7) into (6) and disregarding terms that are independ of β we obtain
The right-hand side of (9) is the logarithm of the famous Cox partial likelihood [12].
3. Typical behaviour of the Maximum (Partial) Likelihood estimator
Computing the estimator in (8) is equivalent to finding the ground state of a fictitious physical system with Hamiltonian , in which the association parameters β act as the degrees of freedom, and the data-set plays the role of quenched disorder. Statistical physics thus allows us to compute the average over the disorder (i.e. the data ) of the log-partial likelihood, giving the typical behaviour of the ML estimator, in the proportional asymptotic regime as
In fact, for reasons that will become clear, we insert a constant regularization factor (whose impact will be zero due to the limit ), and replace (10) by
The standard procedure to carry out the disorder average in parametric regression models via the replica method (which goes back to [17]) builds on the assumption that the log-likelihood is a sum of n independent terms, each depending on a single observation . For the present Hamiltonian this is not the case, so we need an intermediate step. We introduce the empirical distribution
and observe that we can write the Hamiltonian (i.e. minus the log-partial likelihood) as the following functional:
We can now write (11) in a form where the disorder average effectively is computed over the density of states associated with (12). Upon introducing suitable delta-functionals one then achieves factorisation of the disorder average over the samples i in the data set . The full replica derivation is given in appendix
with
and
Here we used the standard short-hand , and W(x) is the Lambert W-function [18], defined by the equation for all x. The only condition on needed in the derivation of the above RS equations is that for the true and inferred linear predictors acquire Gaussian statistics 4 . The extremum in (15) is found to satisfy the so-called RS equations:
These can be written in alternative forms, for instance by using identities such as , , and via integration by parts over z in (20), giving
For the distribution and the functions and we find in RS ansatz the following expressions (see appendix
We note that an alternative way to derive equations (20)–(22) (and a corresponding equation for ) would have been to make appropriate choices such as and in the RS formulae of [9]. However, that alternative route would not have generated the powerful expression (26) found with the present approach. By retracing its derivation and making simple adaptations, (26) can be generalized to
where
(with the standard convention that is kept fixed in the limits ). This result shows very transparently the impact of overfitting on the inferred risk factors , relative to their true values .
At this point it is helpful to write . The interpretation of the values as solved from the RS equations above can be inferred directly from the results in [9]:
with fixed as . The importance of derives from the facts that (asymptotically) the asymptotic distribution of depends only these two quantities [10], and that the overfitting induced inference bias and noise can be expressed in terms of and , respectively. The function has the following interpretation:
with fixed, and with the Breslow estimator as given in (7).
4. Numerical solution of RS equations
For fully parametric GLM models in the overfitting regime the RS equations can always be solved numerically in a relatively straightforward iterative manner by evaluating the relevant integrals via Gaussian quadratures. Here, for the Cox model with censoring, we have the added complication of the functional order parameter , which suggests we take a different approach inspired by population dynamics algorithms and our interpretation (26)–(28). For any given values of the scalar order parameters , and given values of , we can generate samples from the distribution
resulting in . We then define for each the quantities , computed via (17), and the estimator :
For sufficiently large population size m, the iterative algorithm defined by the mapping will now by construction have as its fixed-point the solution of (26)–(28). Furthermore, it is relatively easy to invert numerically equation (24) and write u for fixed as a function of ζ, for instance via Newton's method, so that ζ can always be chosen as the independent parameter of our RS equations (as it is always known). We can then solve the following surrogate set of RS equations:
where now
(e.g. via damped fixed point iteration). When simulating the data, is known and, empirically, we find that substituting equation (38) with its equivalent version, obtained by Gaussian integration by parts of (25),
leads to much faster convergence with smaller population size and hence the latter is adopted instead of (38).
5. Simulations tests of RS theory
When some observations are censored, it is known that in the proportional asymptotic regime the ML estimator does not always exists [20], and a recent paper [11] established that, asymptotically and under the assumption of Gaussian covariates, undergoes a sharp phase transition at some critical value ζc . Here we focus on the region where the ML estimator does exist. Since one obviously cannot solve the RS equations for all possible choices of the hazard function , the censoring rate and the true association amplitude , we have in our simulations chosen a censoring distribution that is uniform between 0 and , reflecting the realistic scenario of a clinical trial of duration where patients are recruited at constant rate for the trial duration. Different choices of S, and true hazard rate will lead to different expected fractions of true (non-censored) events:
To test the predictions of our RS equations we generated 500 survival data-sets of size , with uniformly distributed censoring on the interval , using cumulative hazards of the Log-logistic and Weibull form. We used uncorrelated unit-variance Gaussian covariates, and true association vector (i.e. the first unit vector), hence S = 1. We then applied Cox's ML regression protocol to infer the associations and the cumulative hazard (via Breslow's formula), giving the estimators and . From we subsequently computed the overfitting markers in (31) and (32).
Below we show results obtained when , corresponding to , implying that around 40% of the events are censored, for both choices of . This relatively large censored fraction was deliberately chosen to confirm that the new theory is capable of predicting the behaviour of the various estimators, despite the complication posed by censoring. Further evidence in the form of additional simulations for different values of and hence of is presented in appendix
In figure 1 we plot the inferred cumulative hazards versus the true cumulative hazards , which can be done unambiguously since both are always non-decreasing functions of time. By the nature of the Breslow estimator (7), these curves take the form of 'staircase' functions. We also show in the same panels the RS prediction of the relation between the two quantities, as red dashed lines. We conclude from figure 1 that in all cases the solution of the RS equations correctly predict the typical values of . In figure 2 we compare the replica predictions (solid lines) with the corresponding measurements (markers with error bars), for different values of . Markers give the averages over the 500 data set realizations; error bars indicate the standard deviations. Again we observe excellent agreement between the RS theory and the simulations, in spite of the relatively small sample size n = 400. The latter feature is important for applications of the theory, since real data sets in survival analysis indeed often have sizes of that order.
Download figure:
Standard image High-resolution image6. De-biasing protocols for overfitted ML estimators
Given the agreement between theory and simulations, we now explore the potential of using our theory as a systematic tool with which to decontaminate ML estimators and build asymptotically unbiased estimators and . Two obstacles appear initially to hinder direct application of our RS equations for bias decontamination. First, our equations involve the amplitude S (which is not directly accessible), Second, bias decontamination requires inverting the complex relation between the inferred and true integrated hazards. We have already constructed an algorithm for computing in terms of , but now we need to compute given . In this section we remove both obstacles, and show that the RS theory can indeed be used for creating accurate unbiased estimators in the overfitting regime.
6.1. Derivation of a practical algorithm for de-biasing
Given the assumptions of our theory, we may write the marginal outcome probability as
Hence the log-marginal likelihood density for n i.i.d. observations reads
where we introduced the function
We next compute the ML estimators for by maximization of (44). Taking the functional derivative with respect to gives, after standard manipulations, the following estimator for the hazard rate:
Upon integrating both sides over t, and replacing in the right-hand side the (as yet unknown) value of by the estimator one then finds the following simple fixed-point equation for , that does not involve any parameter that could be susceptible to overfitting:
The result of solving this fixed-point equation by (damped) iteration is remarkably accurate, as will be shown below. Repeating the same steps for the censoring rate gives the unbiased ML estimator
One could in principle attempt to determine S in the same way: extremize (44) with respect to S, and solve the resulting equation simultaneously with (47) for and S. Unfortunately, the resulting equations admit the trivial solution S = 0, describing the trivial situation where the outcomes are independent of the covariates. Thus we need an independent extra equation to determine the value of S, which must be added to and solved simultaneously with the RS order parameter equations and (47). We first note that (31) and (32) imply
with fixed as . If the covariate correlation matrix A is known, the above could serve as the desired extra equation. If the covariate correlation matrix is not known, as would generally be the case, we can use the following result, which exploits the fact that in regression the n quantities are always observable:
This identity is derived in appendix
The final result is the following algorithm, that seeks to combine optimally the RS theory with the available data and the biased ML estimators. After measuring the left-hand side of (50), one solves numerically the four scalar equations (23)–(25) and (50) simultaneously for , e.g. via (damped) fixed-point iteration, with the short-hands (18) for the function and (17) for . The censoring rate in (17) is estimated by the time derivative of (48), and for each value of S, the base hazard by the time derive of the result of iterating (47) to a fixed-point. The various Gaussian integrals can be done via Gauss-Hermite quadrature. The resulting new and unbiased estimator for the association parameters is according to (31) then given by , where .
6.2. Simulation tests of the proposed new estimators
In figure 3 we plot both the uncorrected and the de-biased estimator for the cumulative hazard, for simulations of survival data with n = 400 and around 40% of censoring events. These data show convincingly that the ML estimator (shown in black) is quite biased, with average values as predicted by the RS theory (dashed yellow line), but that the corrected estimator (shown in blue) when plotted against the true cumulative hazard is centered around the red dashed line (the diagonal, corresponding to unbiased estimation). This is remarkable, given that our de-biasing algorithm does not require any knowledge of the data generating process, apart from assuming the form of a semi-parametric proportional hazard model (4). As expected, the corrected estimator exhibits a larger variance than the ML estimator (reflecting the general and well-known bias-variance trade-off in inference). The increasing variance of both estimators at large times is simply a finite size effect: already at t = 2.5 the survival probability is below 0.1, so only few subjects are still alive and able to generate events that may contribute to hazard rate inference. Similarly we show in figures 4–6 the differences between the ML estimators and the new de-biased estimators for the amplitude S (the 'signal strength' of the true associations), the nonzero and zero components of the true association vector respectively. Additional simulations for different values of the censoring rate have been carried out, and the corresponding results can be found in appendix
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageOur simulations support the conclusion that the new decontamination algorithm proposed above indeed leads to virtually unbiased estimators for , S , and , even for relatively small data sets and in the presence of significant levels of censoring. This is a nontrivial result, since we have only required that there is no model mismatch, i.e. that the data were generated from a Cox model, and that the distribution of the risk factors will asymptotically be Gaussian. In particular, the algorithm does not require a priori knowledge of any of the parameters of the data generating process.
7. Conclusion
In this paper we have extended previous analytical studies on overfitting in high dimensional Cox regression for time-to-event data [5, 9, 13], by including censoring. Censoring is a necessary ingredient for any survival analysis theory if its results are to be applied to real medical data (the main application area of survival analysis). While still using the replica method as our main tool, in addition to the inclusion of censoring we have here also chosen a different route to carry out the analysis compared to [5, 9, 13], which enabled us to derive an explicit formula for the inferred distribution of risk factors. This latter formula, in turn, paved the way for the creation of new algorithms for applications of the replica-symmetric (RS) theory, that previously required additional knowledge and/or additional approximations. All our new theoretical results and predictions are validated using numerical simulations.
We consider the main deliverables of the present paper to be the following three. First, we now have an accurate theory with which to understand quantitatively the impact of censoring on overfitting phenomena in the Cox model, so that replica-based overfitting analysis can now be applied to realistic real-world problems in medicine (i.e. to data from clinical studies of finite as opposed to infinite duration). Second, we are now able to solve directly and accurately the RS equations for the inferred cumulative hazard, instead of using the variational approximations proposed in [5, 9, 13]. Thirdly, we have been able to construct from the present theory a practical and accurate algorithm with which to decontaminate inferences for overfitting-induced bias, that does not require knowledge of any of the parameters of the true data generating distribution. The latter algorithm combines in a very effective way the RS equations with those quantities that are measurable in ML regression, and infers the effective signal strength as a by-product.
The results presented here thus have not only a theoretical but also a very practical value. The availability of tools for de-biasing the important estimators of the Cox model implies that one can now achieve significantly improved outcome predictions in realistic clinical scenarios, compared to what would have been achievable with ML regression alone, either by using more covariates for prediction (i.e. increasing by increasing p), or by using fewer patients to do so (i.e. increasing by decreasing n), or by allowing for increased levels of censoring without detrimental consequences. To the best of our knowledge, the presently proposed bias decontamination algorithm, that corrects both association parameters and nuisance parameters (i.e. the cumulative hazard in the Cox model), is the first of its kind.
We envisage future work to involve application of the new estimator decontamination algorithm to real (partially censored) survival data, finding an expression or equation for the transition point ζc of the Cox model in the presence of censoring(via the statistical mechanics route), and working out the RS theory upon including ridge penalization in the Cox formulae for survival analysis with censoring. We also envisage generalizing the new decontamination algorithm to more general GLM regression models, where the efficient computation of the signal strength S, or other possible parameters of the true data generating model, and of the nuisance parameters, have in the past always required unwelcome approximations that we expect will now no longer be necessary.
Acknowledgment
A M would like to thank Mansoor Sheikh and Fabian Aguirre Lopez for constructive discussions. This work was partially supported by the Medical Research Council of the United Kingdom (Grant MR/L01257X/1).
Data availability statement
The data that support the findings of this study are available upon reasonable request from the authors.
Appendix A: Replica derivation
We start from formulae (12)–(14) for the Hamiltonian, being minus the partial log-likelihood of the Cox model. The relevant information on the typical estimators follows from the asymptotic disorder-averaged free energy density in the limit, viz. from , where
To compute the average of the logarithm we use the replica method, giving
A.1. The replicated free energy density
In order to compute we first write in a more convenient form. We introduce the functional delta distribution
in which and are normalized such that . We can now write
For integer r we can next do the average over the data :
in which , is the distribution from which the n covariate vectors z i were drawn, is defined in (4), and
The integrand in the last line of (A.6) depends on the vectors only via the linear predictors . We assume that the distribution of these predictors is Gaussian, either because is Gaussian, or because for the central limit theorem will apply. Since , we may now write, with :
To describe the generation or explanation of finite event times, the linear predictors will always have to remain finite. Hence for realistic data sets the matrix elements Cαρ must remain of order O(1) even in the limit 5 . If we now also insert the integral , and write the δ-function in integral form, we find:
where, using ,
with
It now follows, after exchanging the limits and r → 0, that the asymptotic disorder-average free energy density (A.3) can be computed via steepest descent:
A.2. Saddle point integration
We first derive the stationary conditions for (A.11) with respect to the functions and , via functional differentiation and using (14). This gives, respectively,
Combination of (7) with (12) enables us to recognize in (A.16) the replicated cumulative hazard function,
with which we may write (A.16) more compactly as
Insertion into (A.15) then simplifies the latter to
Upon using also a modest amount of foresight we transform . We then find that our saddle point problem (A.14) can be written as follows:
where
with
We have not used (A.20) to derive (A.22), only (A.19). Hence in (A.21) the functions indeed still have the status of independent variational parameters, and functional differentiation of (A.22) with respect to must therefore reproduce (A.20).
A.3. The replica symmetry (RS) ansatz
We now make the so-called replica symmetric (RS) ansatz. Replica symmetry should affect only the nonzero replica labels , so for the present order parameters C and D the RS ansatz takes the following form:
The RS form of C implies the same for its inverse, where we define and
After some algebra one then finds that
With the RS ansatz we can simplify the functions (A.23) and (A.24). We start with , using the short-hand :
Next we apply the RS ansatz to :
Hence
We then find, using identities such as , (obtained by diagonalizing the RS form of C ), as well as , , , and , that the RS ansatz implies the following for expression (A.22) (modulo irrelevant constants):
We can now extremize over , as demanded by (A.21), giving
Our various formulae takes more compact forms upon introducing the three short-hands , and , which we know from earlier studies to have finite limits. We may then write
and
We may hence now write as follows:
with as defined by (14), and
The maximization over x in (A.41) is achieved for , where
with Lambert's function W(z), the inverse of the function . Note that expression (A.44) can be interpreted in terms of a Moreau envelope, and is in other approaches to overfitting indeed often found via that route [8]. Hence we now arrive at
which can be written in alternative ways, using identities such as .
A.4. Replica symmetric order parameter equations
We next work out the RS form of equation (A.20) in the limit r → 0, using the same manipulations as in working out the previous function , giving
Upon taking the limit we then find the following remarkably simple result, in which the function is as given by equation (A.44):
From this one then obtains the RS expressions of and , via (A.42) and (A.43).
Finally we need to work out the scalar order parameter equations for the trio , by executing the extremization demanded in expression (A.45) for the free energy density. These equations are seen to take the form , with as given in (A.44), where
By using identities such as , one finds that , which enables us to simplify to
From this expression one then derives directly the final order parameter equations for , in which only the equation for u involves some further manipulations:
Finally, after some simple manipulations and usage of (A.52) one finds that at the saddle point the value of in expression (A.48) simplifies to
Appendix B: Variance of inferred risk factors
In this appendix we derive equation (50) for the variance of the inferred risk factors , which is an important tool with which to remove the need for knowing the amplitude S in the construction of unbiased estimators. It exploits our result (26) for the asymptotic form of the distribution (12) (which is the type of quantity that within the replica theory is assumed to be self-averaging 6 ):
Upon using (18) we can thus write
Upon using the order parameter equations (20)–(22) we can now simplify the right-hand side of the latter expression, and find the remarkably simple but powerful result
Appendix C: Additional simulations
This appendix describes tests of the theory against simulations for further choices of the censoring rate, i.e. for different values of . In C.1 we compare the measurements in additional simulations with the solutions of the RS equations, as in section 5. In C.2 we apply the algorithm of section 6 for different values of , to investigate the robustness of the proposed algorithm. The data are generated as explained in the main text, i.e. the cumulative hazard is either of the Weibull form or of the Log-logistic form , the true signal strength is , and the censoring is uniformly distributed in the interval . In order to vary the fraction of censored individuals we tune the end-of-trial time . In particular, we consider the values and .
In all cases it can be seen that the RS theory predicts the typical values of the estimators very accurately (even for n = 400).
C.1. Comparison of theory and simulations
In figures C1 and C2 we compare the predicted values of the overlaps , obtained by solving the RS equations, against the values in simulations, for different values of ζ and for (corresponding to and , respectively). We also show in figure C3 the comparison between the Breslow estimator (solid black lines) against the solution of the RS equations (yellow solid line) when ζ = 0.3 at (top row) and when ζ = 0.4 at (bottom row). As a reference we plot the identity line (solid red), which corresponds to perfect recovery.
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageC.2. Simulations test for the proposed decontamination algorithm
In figure C4 we compare the Breslow estimator (solid black lines) with its decontamined version (blue solid lines), when ζ = 0.3 at (top row) and when ζ = 0.4 at (bottom row). As a reference we also plot the diagonal (solid red), which corresponds to perfect estimation. We also show in figure C5 the histograms of the estimator Sn for S, as obtained via the decontamination algorithm, together with and . The true value S = 1 is plotted as dashed vertical line in black. Finally we compare the histograms of the ML estimators and against their corrected values and , with , and we superimpose the normal density with mean zero and variance (the prediction of the RS theory). Since here , the true values are and .
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageIn all cases we see that the proposed algorithm succeeds in correcting for the overfitting bias: in figure C4 the blue lines are distributed around lies around the (red) diagonal, and in figure C5 the modes of the histograms of and reproduce the true values and .
In order for our algorithm to converge, it is necessary that the ML estimator actually exists. Hence we expect that the algorithm will converge when is sufficiently far from the critical value that marks the censoring equivalent of the phase transition derived in [11]. In a finite size simulation that transition is not sharp, and there will therefore be data instances even below the phase transition line for which the ML estimator does not exist. The (future) inclusion of a ridge penalization should cure these pathologies, which are due to the use of ML regression, rather than to the algorithm itself. Nevertheless, estimating correctly the hazard rate of the event under study will always be more difficult when only a few primary events are observed in a sample. This implies that any algorithm must inevitably become less reliable as decreases toward zero.
Footnotes
- 4
This will be trivially true if the covariate distribution is itself Gaussian, but also if the conditions for the central limit theorem to hold apply (i.e. if the components of z not too strongly correlated, and the components of β are not too dissimilar in their scaling with n; see [19]).
- 5
- 6