当前位置: X-MOL 学术J. Phys. A: Math. Theor. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Replica analysis of overfitting in regression models for time to event data: the impact of censoring
Journal of Physics A: Mathematical and Theoretical ( IF 2.1 ) Pub Date : 2024-03-11 , DOI: 10.1088/1751-8121/ad2e40
E Massa , A Mozeika , A C C Coolen

We use statistical mechanics techniques, viz. the replica method, to model the effect of censoring on overfitting in Cox’s proportional hazards model, the dominant regression method for time-to-event data. In the overfitting regime, Maximum Likelihood (ML) parameter estimators are known to be biased already for small values of the ratio of the number of covariates over the number of samples. The inclusion of censoring was avoided in previous overfitting analyses for mathematical convenience, but is vital to make any theory applicable to real-world medical data, where censoring is ubiquitous. Upon constructing efficient algorithms for solving the new (and more complex) Replica Symmetric (RS) equations and comparing the solutions with numerical simulation data, we find excellent agreement, even for large censoring rates. We then address the practical problem of using the theory to correct the biased ML estimators without knowledge of the data-generating distribution. This is achieved via a novel numerical algorithm that self-consistently approximates all relevant parameters of the data generating distribution while simultaneously solving the RS equations. We investigate numerically the statistics of the corrected estimators, and show that the proposed new algorithm indeed succeeds in removing the bias of the ML estimators, for both the association parameters and for the cumulative hazard.

中文翻译:

事件时间数据回归模型中过度拟合的复制分析:审查的影响

我们使用统计力学技术,即。复制方法,用于模拟审查对 Cox 比例风险模型中过度拟合的影响,Cox 比例风险模型是事件时间数据的主要回归方法。在过度拟合情况下,已知最大似然 (ML) 参数估计量对于协变量数量与样本数量之比的较小值已经存在偏差。为了数学上的方便,之前的过度拟合分析中避免了审查的包含,但这对于使任何理论适用于审查无处不在的现实世界的医学数据至关重要。通过构建有效的算法来求解新的(且更复杂的)复制对称 (RS) 方程并将解与数值模拟数据进行比较,我们发现即使对于较大的审查率,也具有良好的一致性。然后,我们解决了在不了解数据生成分布的情况下使用理论来纠正有偏差的 ML 估计量的实际问题。这是通过一种新颖的数值算法实现的,该算法自洽地近似数据生成分布的所有相关参数,同时求解 RS 方程。我们对修正估计量的统计数据进行了数值研究,结果表明,所提出的新算法确实成功地消除了 ML 估计量的偏差,无论是对于关联参数还是对于累积风险。
更新日期:2024-03-11
down
wechat
bug