1 Introduction

Personalized medicine has gained escalating importance in contemporary clinical practice, as the potential for tailored treatments designed for individual patients holds promise for augmenting the effectiveness of interventions (Hamburg and Collins 2010; Chin 2011). However, in the case of diseases such as cancer, significant heterogeneity exists among patients, impacting disease progression and responses to specific treatments. Consequently, identifying cancer subgroups and disparities in diagnoses can be immensely valuable in tailoring optimized therapies for each patient, ultimately improving healthcare outcomes, including survival rates. To achieve this goal, the discovery of both prognostic markers (primary biomarkers) and predictive biomarkers (biomarker-treatment interactions) is crucial in cancer drug development and clinical practice. These biomarkers have the potential to personalize therapies for individual patients and enhance treatment effectiveness.

Recent advances in biotechnology have led to the generation of vast amounts of complex biological and molecular data. Modern high-throughput technologies can simultaneously measure the expression levels of thousands of genes. Databases like the Gene Expression Omnibus (GEO) and Array Express provide extensive resources for cancer genetic research (Barrett 2010). However, a common challenge in these datasets is that the number of genes (p) is often equal to or even greater than the number of samples (n). This situation becomes more complex when researchers aim to identify both prognostic biomarkers (genes) and predictive biomarkers (gene-treatment interactions), increasing the dimensionality of the data. In such cases, it is crucial to assess the significance of each variable using p values and to correct for multiple testing, especially in clinical applications. Of note, controlling the false positive rate, specifically the family-wise error rates (FWER), is a paramount consideration. False positives can lead to erroneous conclusions, resource wastage, the diversion of research efforts towards unproductive avenues, among others. Hence, our commitment to controlling false positives is rooted in the need to uphold the integrity of scientific and medical research. The central focus of this paper resides in the identification of biomarker signatures, encompassing both prognostic and predictive biomarkers, through the assignment of valid p values within the context of the survival framework, all while effectively controlling FWER.

Over the past two decades, a multitude of regularization techniques have emerged to facilitate feature selection and yield sparse parameter estimates when grappling with high-dimensional data. One prominent method is the least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996), a preeminent method known for furnishing sparse estimates through an \(L_1\) penalty, encompassing optimal tuning parameters for linear models. This work has led to the development of various extensions (Fan and Li 2001; Zou 2006; Ghosh 2007; Yuan and Lin 2006; Wang and Leng 2008). To deal with ultra-high-dimensional data, Fan and Lv (2008a) introduced a correlation screening technique termed Sure Independence Screening (SIS). This method effectively reduces the dimensionality from an extremely large scale to a more manageable one, making it easier to use LASSO for variable selection. This combination is referred to as SIS-Lasso. These techniques, tailored for variable selection and screening, have been expanded to accommodate survival outcomes within the context of Cox proportional hazards (PH) models (Tibshirani 1997; Fan and Li 2002; Zhang and Lu 2007; Simon 2011; He 2019; Fan 2010; Zhao and Li 2012).

While regularization and screening techniques are effective in producing sparse and interpretable estimates, they face challenges in maintaining control over type I error rates, primarily due to issues with p values obtained from penalized likelihood. Recent advancements have addressed these challenges in high-dimensional linear models. For example, Wasserman and Roeder (2009) introduced a “screen and clean" procedure, involving data division into a training set for variable screening (using LASSO) and a testing set for significance testing. Several other methods exhibit screening properties, such as the adaptive Lasso (Zou 2006) and the smoothly clipped absolute deviation (SCAD) Fan and Li (2001). Later, Meinshausen (2009) enhanced the approach by iteratively repeating the split-and-fit procedure, computing p values for each split, and aggregating them to establish a collective p-value for the purpose of controlling FWER. Related works include Meinshausen and Yu (2009); Bühlmann (2013); Zhang and Zhang (2014); Dezeure (2015). Recent breakthroughs in this domain include the work of Zuo et al. (2021), who introduced a groundbreaking variable selection approach termed “penalized regression with second-generation p values" (ProSGPV). This method combines an \(L_1\) penalization scheme with second-generation p values (SGPV) to identify variables suitable for inclusion in the model. While these methods have proven effective in generalized linear models for high-dimensional data, their application within the survival framework is an emerging area that requires further development and exploration.

In this paper, we present a novel method for detecting biomarker signatures that considers both main effects and biomarker-by-interaction effects within the survival framework while effectively managing the FWER. To achieve this, we extend the concept introduced by Meinshausen (2009) to the Cox survival model, employing a three-stage process that enables the identification of prognostic and predictive biomarkers while assigning valid p values. Our contributions include: (1) Application to high-dimensional datasets using a penalized technique for variable selection, facilitating the identification of biomarker signatures that encompass both prognostic and predictive biomarkers; (2) Addressing the challenge of multiple testing by obtaining p values from randomized multi-split data, ensuring robust control of the FWER; (3) Providing a user-friendly R implementation of our algorithm, available at https://github.com/aliviawu/Biomarker-Paper/tree/main. Additionally, we offer comprehensive theoretical properties in the Supplementary Materials. By integrating main effects and biomarker-by-interactions within the survival framework and ensuring strict control over the FWER, our approach makes a valuable contribution to the field of biomarker identification and statistical inference in high-dimensional data scenarios.

The remainder of this paper is organized as follows. In Sect. 2, we provide a detailed description of our proposed three-stage approach within the Cox PH model framework. This approach considers main effects and biomarker-by-interaction effects while effectively controlling the FWER. Section 3 presents the results of our simulation studies, including comparisons with existing methods. In Sect. 4, we apply our proposed method to multiple real-world datasets. Finally, in Sect. 5, we summarize our findings and discuss potential directions for future research.

2 Materials and methods

To achieve FWER control in biomarker selection, we propose a three-stage approach that builds on the penalized-likelihood approach using adaptive gLASSO (Wang and Leng 2008). Our approach extends the idea of p-value adjustment developed by Meinshausen (2009) to the Cox proportional hazards framework. Additionally, we provide a description of several existing alternatives for comparison purposes.

2.1 Notation

Consider a study with n subjects and p potential biomarkers. Let \(T_i\) denote the event time for the \(i^{th}\) subject and \(C_i\) denote the censoring time. The follow-up time is defined as \(Y_i = \min (T_i, C_i)\), and the event indicator is \(\delta _i = I(T_i \le C_i)\), where \(I(\cdot )\) is an indicator function. We focus on right-censored data. The p-dimensional candidate biomarkers are denoted by \({\textbf {X}}_i = (X_{i1}, X_{i2}, \ldots , X_{ip})^{T}\). The treatment status for the \(i^{th}\) patient is denoted by \(H_i\). We assume that \(T_i\) and \(C_i\) are conditionally independent given \({\textbf {X}}_i\). The hazard function of the \(i^{th}\) patient under the Cox PH model can be expressed as:

$$\begin{aligned} h(t|{\textbf {X}}_i,H_i)=h_0(t)\exp (\alpha _0H_i+\alpha _1X_{i1}+\dots +\alpha _pX_{ip}+\gamma _1X_{i1}H_i+\dots +\gamma _p X_{ip}H_i), \end{aligned}$$
(1)

with the total number of regression parameters as \(2p+1\). We denote the number of main biomarkers (prognostic) and its interaction with treatment (predictive) to be \(\widetilde{p}=2p\). Our objective is to identify prognostic biomarkers (\(\alpha _j \ne 0\), \(j=1,\ldots ,p\)) and predictive biomarkers (\(\gamma _j \ne 0\), \(j=1,\ldots ,p\)) in situations where \(n \ll \widetilde{p}\).

2.2 Adaptive gLASSO for variable selection

Under the Cox PH framework, let D denote the indices of the subjects who experienced the event of interest, and for each \(r \in D\), the observed failure time is denoted by \(t_r\). The set \(R_r=\{i: Y_i \ge t_r\}\) includes the indices of the individuals who are at risk of experiencing the event at time \(t_r\). Let \(\varvec{{\theta }}=\{\alpha _0,\alpha _j,\gamma _j,j=1,\dots ,p\}\) be a vector of parameters with dimensionality \(\widetilde{p}+1\), and let the semi-parametric partial likelihood function for parameter estimation be denoted by

$$\begin{aligned} L(\varvec{\theta })=\prod _{r \in D}\frac{ h(t|{\textbf {X}}_i,H_i)}{\sum _{i \in R_r} h(t|{\textbf {X}}_i,H_i)}. \end{aligned}$$
(2)

Let \(\ell (\varvec{\theta })\) denote the logarithm of the partial likelihood function \(\log L(\varvec{\theta })\). The estimates of the parameters \(\varvec{\theta }\) can be obtained by maximizing \(\ell (\varvec{\theta })\). However, when the number of parameters \(\widetilde{p}\) is larger than the sample size n (\(n\ll \widetilde{p}\)), the semi-parametric likelihood estimator in Eq. (2) may not be feasible due to the difficulty in finding the global maximum as the number of biomarkers increases.

To select prognostic and predictive biomarkers with the oracle property, one commonly used method with a weighted adaptive gLASSO penalty can be considered. The objective function that the adaptive gLASSO minimizes is:

$$\begin{aligned} \ell ^{aGL}(\varvec{\theta })=\ell (\varvec{\theta })+\lambda _1\left( \sum _{g=1}^{p}\frac{1}{\hat{\omega }_g}\Vert \varvec{\theta }_{G_g}\Vert _1\right) , \end{aligned}$$
(3)

where \(|\cdot |_1\) denotes the \(L_1\) norm, \(\lambda _1\) is the shrinkage parameter, \(G_g\) is the index set belonging to the g-th pair of prognostic and predictive biomarkers (\(g=1,\dots ,p\)), and \(\varvec{\theta }_{G_g}=\{\alpha _g,\gamma _g\}\) is the g-th pair of estimated coefficients belonging to \(G_g\). Additionally, \(\hat{\omega}_g\) is an adaptive weight vector obtained by performing \(L_2\) regularization for each coefficient of \(\alpha\) and \(\gamma\), which is defined as follows:

$$\begin{aligned} \hat{\omega }_g=\frac{1}{\sqrt{(\hat{\alpha }_g^{ini})^2 + (\hat{\gamma }_g^{ini})^2}}, \end{aligned}$$
(4)

where \(\hat{\alpha }_g^{ini}\) and \(\hat{\gamma }_g^{ini}\) are initial estimators obtained from a ridge regression (Hoerl and Kennard 1970), \(g=1,\dots ,p\).

Subsequently, the group-based estimates can be obtained by maximizing \(\ell ^{aGL}(\varvec{\theta })\). Biomarkers with non-zero coefficients (\(\alpha _g\ne 0\) or \(\gamma _g\ne 0\), where \(g=1,\dots ,p\)) will be selected. However, it should be noted that the FWER is not well controlled by this method.

2.3 The proposed three-stage approach

To control the FWER, we propose a three-stage strategy based on the penalized likelihood approach. This strategy extends the concept of p-value adjustment introduced by (Meinshausen 2009) to the Cox proportional hazards model. The algorithm for our proposed three-stage approach is described in detail below.

2.3.1 Stage I: conduct feature screening and obtain (nonaggregated) p values

In the first stage, we perform feature screening to reduce the dimensionality from \(\widetilde{p}\) to a more manageable scale d with \(d<n\). We then use the remaining d/2 pairs of prognostic and predictive biomarkers for variable selection based on penalized techniques (i.e., adaptive gLASSO) as well as p-value adjustment. Both feature screening and p-value adjustment are performed through a bootstrapping procedure.

For \(b=1\dots B\),

  1. (i)

    Randomly split the data into two sets, a “training" set and a “testing" set, with some allocation rate of m (e.g., \(m=0.5\) indicates an equal sample size). Of note, the selection of m depends on various factors, including the dataset sample size, the study’s objectives, and other relevant considerations.;

  2. (ii)

    Perform feature screening via joint hypotheses using the training data. To identify the significant predictors among the main biomarkers and their interactions with treatment, the likelihood ratio test (LRT) can be utilized. For each j in 1, ..., p, the log-likelihood under the null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_A\)) can be compared:

    $$\begin{aligned} H_0: \alpha _j=0, \gamma _j=0, H_A: \text {at least one of } \alpha _j \text {and } \gamma _j \text { is not equal to 0}. \end{aligned}$$
    (5)

    The LRT statistic is computed as the difference between the partial log-likelihood statistics of two models: \(H_0\), which only includes the treatment variable (\(H_j\)), and \(H_A\), which includes the treatment variable (\(H_j\)), a main biomarker (\(X_j\)), and its interaction with treatment (\(X_jH\)). In general, the LRT statistic is expressed as \(LRT=-2\ln (\frac{\ell _0}{\ell _A})\sim \chi ^2_2\). To screen the predictors efficiently, existing screening procedures often require the specification of a threshold. Here, we adopt the conventional threshold of \([n/\log n]\) (Fan and Lv 2008b). Specifically, the screening process retains the top \([n/\log n]\) pairs based on the rank of the Chi-square statistics of the joint hypothesis testing for each biomarker and its interaction with treatment.

  3. (iii)

    Apply adaptive gLASSO on the training set to select pairs using the pre-selected pairs of main biomarkers and their interactions with treatment. The selected pairs are denoted by \(\widetilde{S}^{(b)}\);

  4. (iv)

    Obtain the p-values for each selected pair in \(\widetilde{S}^{(b)}\) by performing LRT using testing dataset. Using only testing data, we employ a LRT hypothesis test for each selected pair in \(\widetilde{S}^{(b)}\), and calculate the corresponding chi-square based p-values, \(\widetilde{p}_j^{(b)}\), for \(j\in \widetilde{S}^{(b)}\). For unselected pairs, we set the corresponding p-values to 1.

  5. (v)

    Adjusted p-values based on the Bonferroni correction for \(\widetilde{p}_j^{(b)}\), denoted by \(p_j^{(b)}\) (\(j=1,\ldots , p\)). For \(j=1,2,\ldots , p\), we have

    $$\begin{aligned} p_j^{(b)}=\min \left\{ \frac{1}{2}\widetilde{p}_j^{(b)}|\widetilde{S}^{(b)}|,1\right\} , \end{aligned}$$
    (6)

    where \(|\widetilde{S}|^{(b)}\) denotes the total number of selected pairs in \(\widetilde{S}^{(b)}\).

2.3.2 Stage II: obtain aggregated p values

In the first stage, we obtain a total of B p-values for each pair of prognostic and predictive biomarkers. To aggregate these p-values, we use quantiles introduced by (Meinshausen 2009).

Specifically, we define

$$\begin{aligned} Q_j(\eta )= \min \left\{ 1,q_{\eta }\left( \left\{ p_j^{(b)}/\eta ; b=1,\ldots ,B\right\} \right) \right\} , \end{aligned}$$

where \(q_{\eta }\) is the \(\eta ^{th}\) quantile for the set \(\left\{ p_j^{(b)}/\eta ; b=1,\ldots ,B\right\}\). The aggregated p-value is denoted as \(p_j^*\) and is defined as

$$\begin{aligned} p_j^*=\min \left\{ 1, \left( 1-\log \eta _{\min }\right) \inf _{\eta \in (\eta _{\min },1)}Q_j(\eta )\right\} , \end{aligned}$$
(7)

where \(\eta _{\min }\in (0,1)\), and recommended choice is 0.05 as suggested by (Meinshausen 2009). Note that \(H_{0j}:\alpha _j=\gamma _j=0\) is rejected if \(p_j^*\le \alpha\), where \(\alpha\) is the pre-specified FWER to be preserved (\(j=1,\ldots , p\)).

2.3.3 Stage III: biomarker signature identification

In the last stage, we perform further regression analysis on the pairs of prognostic and predictive biomarkers selected in Stage II. We fit a Cox PH model using the entire dataset by maximizing the partial likelihood. We identify biomarkers with p-values less than the pre-specified significance level \(\alpha\). The significant biomarkers, along with their interactions with treatment, form the biomarker signature for predicting patient outcomes in response to treatment.

Identification of prognostic and predictive biomarkers related to disease progression and treatment response in a survival framework is crucial for personalized medicine. However, existing penalized approaches for high-dimensional data often suffer from a lack of control over the FWER. Our proposed three-stage strategy that combines penalized likelihood techniques with adjusted p values obtained through random data splitting provides a reliable and interpretable approach for identifying biomarker signatures with strong prognostic and predictive power, while effectively controlling FWER.

In Stage I, we use a bootstrapping procedure to reduce the dimensionality of the training dataset to a moderate scale and select active pairs of prognostic and predictive biomarkers via an adaptive group LASSO technique. We then accumulate corrected p-values \(p_j^{(b)}\) for each active pair via a likelihood ratio test based on the testing dataset. In Stage II, we summarize these non-aggregated p-values to \(p_j^*\) using an adaptive empirical quantile function and select the pairs based on \(p_j^*\). Finally, in Stage III, we identify the final biomarkers by fitting the selected pairs from Stage II with a Cox PH model based on the entire dataset.

2.4 Other existing methods

To empirically evaluate the performance of our proposed approach, we compare it with several existing methods. In the literature, there are various methods for variable selection from biomarker main effects and biomarker-by-treatment interactions. Ternès (2016) conducted a comprehensive summary of possible approaches for high-dimensional Cox PH regression. They compared these methods through simulations with different numbers of biomarkers and varying effects of main biomarkers and interactions with treatment, and evaluated their selection abilities in null (i.e., no interactions with treatment) and alternative scenarios (i.e., at least one interaction with treatment). In the null scenarios, group LASSO and gradient boosting methods performed poorly in the presence of non-null main effects but performed well in alternative scenarios with high interaction strengths. Adaptive LASSO with grouped weights was found to be too conservative. Principal component analysis (PCA) combined with LASSO performed moderately. Both LASSO and adaptive LASSO performed well, although LASSO was relatively poor in the presence of only non-null main effects. Here, we describe several competing methods that we consider for comparison.

2.4.1 LASSO

In the Cox PH framework, variable selection is typically performed by minimizing the log partial likelihood subject to a penalty on the parameters, as proposed by Tibshirani (1997). We use the LASSO penalty for both the main effects \(\alpha _j\) and their interaction effects with treatment \(\gamma _j\) (\(j=1,...,p\)) in Eq. (1) to perform variable selection, enabling us to identify both prognostic and predictive biomarkers. With the semi-parametric partial likelihood function defined in Eq. (2), let \(\varvec{\theta }=\{\alpha _0, \alpha _j, \gamma _j\}_{j=1}^{p}\), where \(\alpha _0\) denotes the treatment effect, \(\alpha _j\) denotes the \(j^{th}\) prognostic biomarker effect and \(\gamma _j\) denotes the \(j^{th}\) predictive biomarker effect. The partial log-likelihood with the LASSO penalty is

$$\begin{aligned} \ell ^{L}(\varvec{\theta })=\ell (\varvec{\theta })+\lambda \left( \sum _{j=1}^p|{\alpha _j}|+ \sum _{j=1}^p|{\gamma _j}|\right) , \end{aligned}$$
(8)

where prognostic biomarkers and predictive biomarkers are equally penalized with the shrinkage parameter \(\lambda\). This tuning parameter \(\lambda\) is chosen by fivefold cross-validation. The LASSO-based coefficient estimators can then be obtained by maximizing \(\ell ^{L}(\varvec{\theta })\), and the predictive and prognostic biomarkers (\(\alpha _j\ne 0\), \(\gamma _j\ne 0\)) are selected.

2.4.2 Adaptive LASSO with grouped weights

Adaptive LASSO is a penalization method that assigns different penalty weights to the main effects and interaction effects, with larger coefficients penalized less than smaller ones to highlight their differences (Zou 2006; Zhang and Lu 2007). In the initial stage, this method estimates the weights by including the treatment and all biomarker main effects and interactions with the treatment, and applies a ridge penalty (Hoerl and Kennard 1970). Let \(\alpha _{j}\) and \(\gamma _{j}\) (\(j=1,...,p\)) be the main effects and interaction effects of the biomarkers, respectively. The penalty term with the shrinkage parameter \(\lambda _2\) to control the magnitude of \(\alpha _{j}\) and \(\gamma _{j}\) is

$$\begin{aligned} \lambda _2(\sum _{j=1}^p\alpha _{j}^2+\sum _{j=1}^p\gamma _{j}^2). \end{aligned}$$
(9)

In the second stage, a common grouped weight is estimated for all \(\alpha _j\) and a single weight is assigned to all \(\gamma _j\) as the average of \(\alpha _{j}\) and \(\gamma _{j}\) from the preliminary stage, i.e., \(\alpha _R=\frac{1}{p}\sum ^{p}_{j=1}|\alpha _j|\), \(\gamma _R=\frac{1}{p}\sum ^p_{j=1}|\gamma _j|\). The penalized log-likelihood with \(\theta =\{\alpha _0, \alpha _j, \gamma _j\}_{j=1}^p\) is

$$\begin{aligned} \ell ^{aL}(\varvec{\theta })=\ell (\varvec{\theta })+\lambda \left( \frac{1}{\alpha _R}\sum _{j=1}^p|{\alpha _j}|+ \frac{1}{\gamma _R}\sum _{j=1}^p|{\gamma _j}|\right) . \end{aligned}$$
(10)

2.4.3 Gradient boosting

Boosting algorithms are designed to enhance prediction accuracy by training a sequence of weak models, each correcting the errors of its predecessors. In high-dimensional settings, the process starts from the null model and updates a single coefficient at each step. This iterative process stops when the model achieves a balance between bias and variance. Gradient boosting reformulates this approach as a numerical optimization problem, where the objective is to minimize the model’s loss function by adding weak learners using gradient descent (Friedman 2001). Bühlmann and Yu (2003) proposed \(L_{2}\)-Boost with a novel component-wise smoothing spline learner, providing an effective procedure for carrying out boosting for high-dimensional regression problems with continuous predictors. In our study, we first estimate the treatment effect preliminarily and then fix it as an offset.

2.4.4 PCA+LASSO

In the first stage, we use PCA (Hastie 2017) to reduce the dimensionality of the main effect matrix. The second stage applies the LASSO penalty to the interactions, which allows for the identification of predictive biomarkers based on the first K principal components of the main effects. In the final stage, we fit a Cox PH model by maximizing the partial likelihood based on all biomarkers and selected biomarker-treatment interactions. We then select prognostic and predictive biomarkers with p-values less than \(\alpha\).

3 Simulation studies

3.1 Simulation setup

We make the following assumptions regarding the true hazard function for the i-th patient:

$$\begin{aligned} h(t|{\textbf {X}}_i)=h_0(t )\exp \left( \alpha _0 H_i+\alpha _4X_{i4}+\gamma _4X_{i4}H_i+\alpha _5X_{i5}\right). \end{aligned}$$
(11)

Here, \(\alpha _0\) denotes the impact of treatment H, \(\alpha _4\) and \(\alpha _5\) signify the prognostic effects of biomarkers \(X_4\) and \(X_5\), respectively, and \(\gamma _4\) represents the predictive effect of biomarker \(X_4\). We investigate four distinct scenarios, each characterized by distinct values of (\(\alpha _0\), \(\alpha _4, \gamma _4, \alpha _5\)): Scenario 1 (S1): (1, 1, 1, 1); Scenario 2 (S2): (1, 0.5, 1, 1); Scenario 3 (S3): (1, 0.5, 1.5, 1); Scenario 4 (S4): (1, 0.5, 2.5, 1.5). To simulate individual participants, we generate a treatment indicator variable for each using a Binom(1, 0.5) distribution. The expression level of the primary j-th biomarker \(X_{ij}\) follows a standard normal distribution. The pairwise correlation between prognostic biomarkers is set to \(\rho =0.15\) in order to mimic our real data applications. The baseline hazard function \(h_0(t)\) is set up as a constant \(\frac{1}{100}\) and \(\frac{1}{270}\), resulting in mean censoring rates of around \(40\%\) or \(60\%\), respectively. Censoring times \(C_i\) are generated from a uniform distribution Unif(0, 150). For the estimation of survival times, we employ the method introduced by (Bender 2005). This involves drawing from a Unif(0, 1) random variable and subsequently applying the transformations:

$$\begin{aligned} t_i=-log(U)/ h(t|{\textbf {X}}_i). \end{aligned}$$

The observed time to an event is then computed as \(min(t_i, C_i)\), along with the event status denoted as \(I(t_i<C_i)\). To provide a comprehensive evaluation of our results, we generate 1,000 Monte Carlo datasets for each scenario. Furthermore, we consider varying sample sizes of 300, 500, or 1000, with the allocation rate \(m=0.5\) and the total number of candidate prognostic and predictive biomarkers \(\widetilde{p}\) set to 1000, 2000, or 4000.

We evaluate four primary performance metrics: selection accuracy, mean squared error (MSE), relative bias of regression coefficient estimates, and control of FWER. Selection accuracy measures the percentage of times the true biomarker is selected out of 1,000 replicates. We estimate MSE as the mean of the squared difference between the true and estimated parameters across 1,000 replicates. Relative bias is evaluated as the mean difference divided by the true parameter value across 1,000 replicates. FWER control measures the proportion of times in 1,000 replicates that at least one biomarker, which is not one of the three candidate biomarkers, is selected while controlling the FWER at the nominal level of \(\alpha =0.05\). We investigate the impact of sample size, number of biomarkers, and censoring rates on these four metrics across various scenarios. Additionally, we compare the proposed method with five other methods, namely LASSO, gLASSO, adaptive LASSO, PCA+LASSO, and gradient boosting.

3.2 Simulation results

3.2.1 FWER control

Our proposed method effectively controls the FWER at a nominal level of 0.05 for selecting prognostic and predictive biomarkers, as demonstrated in Fig. 1. We evaluated the actual FWER across 1,000 replicates for four different scenarios with varying sample sizes (n), the total number of biomarkers (\(\widetilde{p}\)), and censoring rates, except for the case where \(\widetilde{p}=n=1000\) because it does not represent a high-dimensional scenario. Our method shows effective FWER control at approximately 0.05 for all four scenarios, particularly when the sample size is 1000. Although the FWERs were inflated for sample sizes of 500 or 800, we observed a return to 0.05 with a sample size of 1000. Furthermore, we conducted additional simulations using a different allocation rate, i.e., \(m=0.7\) (allocating 70% of the data for training and 30% for testing). We observed minimal differences in FWERs. For example, with sample sizes of \(n=300, 500\) in S1 (\(\widetilde{p}=2000\)), the FWERs were 0.062 and 0.048, respectively, and also similar trends emerged across varying sample sizes (not shown due to space limit). Moreover, we also considered a higher value of the correlation coefficient, \(\rho =0.3\), for S1. Across different combinations of sample sizes and the number of biomarkers denoted as \({(n, \widetilde{p})}={(300,1000),(300,2000),(500,1000),(500,2000),(1000,2000),(1000,4000)}\), we obtained satisfactory results with the FWERs of 0.055, 0.056, 0.051, 0.045, 0.059, and 0.048, respectively. In addition, We also incorporated a non-randomized clinical trial with an 80% treatment proportion, mirroring our second data application. Thus, in Scenario 1 (\(\alpha _4=\alpha _5=\gamma _4=1\)), with sample sizes of \(n=(300,500)\) and \(\widetilde{p}\)=2000, the obtained FWERs are 0.045 and 0.040, respectively.

3.2.2 Estimates of effects

The accurate estimation of prognostic and predictive biomarker effects is crucial for predicting hazard and survival rates. Boxplots of the estimates of \(\alpha _4\), \(\gamma _4\), and \(\alpha _5\) are presented in Fig. 2 for scenarios S1 and S4, both with a 60% censoring rate. Additional scenarios are presented in Figure S5 and S6 in the Supplementary Materials. The average coefficient estimates of \(\alpha _4\) and \(\alpha _5\) from 1,000 simulated datasets closely match the true effects of (1, 1) and (0.5, 2.5) for scenarios S1 and S4, respectively. As the sample size increases from 300 to 1000, the dispersion of the estimated \(\gamma _4\) gradually decreases, and the estimates tend to center around the true effects. Moreover, the coefficient estimates for the 40% censoring rate exhibit similar trends in terms of deviation from the true values, as shown in Figure S5 in the Supplementary Materials.

The MSEs and biases of the estimates are presented in Table 2 and Table S1 in the Supplementary Materials, respectively. The results indicate that, with a censoring rate of 40%, the MSEs and biases for all estimates are generally smaller compared to those with a 60% censoring rate. Moreover, increasing the sample size leads to a reduction in MSEs and relative bias, which is consistent with our expectations. Regarding the interaction effect of \(\gamma _4\), although the MSEs are relatively higher compared to the main effects (e.g., \(\alpha _4\) and \(\alpha _5\)), the bias decreases as the true interaction effect increases from 1 to 1.5. Additionally, the underlying true effect strength of the primary biomarkers influences the estimation of their interaction effect. For example, as the true effect value of \(\alpha _4\) decreases from 1 (S1) to 0.5 (S2), its bias decreases, while the bias of its interaction effect with \(\gamma _4\) increases.

3.2.3 Selection accuracy

The results of the selection accuracy analysis are summarized in Table 1. The findings indicate that selection accuracy improves with larger sample sizes, lower censoring rates, and greater biomarker effects. When the sample size is sufficiently large (i.e., >800), the true biomarker effects and censoring rate have minimal effects on selection accuracy. Both the biomarker and interaction term selections are influenced by the underlying true effect strength, and the accuracy of the interaction term selection tends to increase as the main effect decreases. When \(\alpha _4\) decreases from 1 (S1) to 0.5 (S2), the selection accuracy of \(X_4\) decreases, but the accuracy of \(X_4H\) increases, particularly with small sample sizes and high censoring rates. Furthermore, sample size is a critical factor in determining selection accuracy. As the sample size increases from 300 to 500, the selection percentages for all biomarkers are consistently around 100%.

Table 1 Selection accuracy (%) of biomarker effect estimators for proposed three-stage method; four scenarios: \(S1(\alpha _4 = 1, \gamma _4=1, \alpha _5=1)\); \(S2(\alpha _4=0.5, \gamma _4=1, \alpha _5=1)\); \(S3(\alpha _4=0.5, \gamma _4=1.5, \alpha _5=1)\); \(S4(\alpha _4=0.5,\gamma _4=2.5,\alpha _5=1.5)\)

In conclusion, higher selection percentages are observed when the sample size is large, the censoring rate is small, and the true biomarker effect is strong. The proposed method is not significantly affected by the number of candidate biomarkers, as the selection accuracy remains relatively stable as \(\widetilde{p}\) increases.

Table 2 Mean square errors (MSEs) of biomarker effect estimators for three-stage method; four scenarios: \(S1(\alpha _4 = 1, \gamma _4=1, \alpha _5=1)\); \(S2(\alpha _4=0.5, \gamma _4=1, \alpha _5=1)\); \(S3(\alpha _4=0.5, \gamma _4=1.5, \alpha _5=1)\); \(S4(\alpha _4=0.5,\gamma _4=2.5,\alpha _5=1.5)\)

3.2.4 Method comparison

We compared our three-stage strategy with five other methods: LASSO, adaptive LASSO, gLASSO, PCA+LASSO, and gradient boosting. The comparison was based on 11 setups combining various sample sizes and biomarker numbers under high censoring rate, and additional scenarios are available in the Supplementary Materials. Since LASSO and adaptive LASSO had similar performance, we excluded the results of adaptive LASSO from the analysis. As shown in Fig. 3, our proposed three-stage method effectively controls the FWERs, whereas the other four methods fail to achieve this goal. For all alternative methods, the FWERs for all scenarios are close to 1. These results agree with our expectation. Notably, methods such as LASSO, boosting, PCA+LASSO, and adaptive LASSO do not take into account the correlation between the primary biomarker and its interaction with treatment or impose the hierarchy constraint, despite incorporating various regularization strategies for variable selection. Group LASSO does attempt to tackle these concerns through its regularization approach; however, the p values obtained for the selected variables do not effectively control the FWER. In contrast, our method is designed to provide valid asymptotic control over variable inclusion at the nominal level, which is made possible by integrating the multiple sample-splitting approach and incorporating features such as correlation and hierarchy (e.g., group lasso) into the regularization techniques after initial feature screening.

For scenarios S1 and S2 (with details available in the Supplementary Materials), we evaluated the selection accuracy of prognostic and predictive biomarkers using five methods under four scenarios, with sample sizes of 300, 500, 800, and 1000, 2000 biomarkers, and 40% and 60% censoring rates. Overall, gLASSO and PCA+LASSO showed relatively high selection accuracy of the interaction term \(X_4H\), but performed poorly in selecting the main effects. On the other hand, LASSO and gradient boosting achieved selection accuracies close to 1 and were insensitive to the censoring rate, sample size, number of biomarkers, and scenario. However, our proposed method controls the selection accuracy with respect to increases in the sample size or underlying biomarker effects or a decrease in the censoring rate. Furthermore, scenarios S3 and S4 (details available in the Supplementary Materials) compared the coefficient estimates based on different methods. In comparison with the four existing methods, our model provided unbiased estimates of the effects with improved efficiency throughout. The simulations were carried out using R 4.1.2 on a high-performance computing cluster. For each dataset with 50 iterations of the bootstrapping procedure, the computation time for the proposed method is approximate 6 min, considering the number of biomarkers as 1,000 and sample size of 300, however, this time may increase (up to 26 min) with larger number of biomarkers and sample sizes. Other alternative methods need less time due to the lack of bootstrap procedures (i.e., 3 min for the case with 1,000 biomarkers and sample size of 300).

4 Applications

In the first example, we present an application of our method using an existing breast cancer study that includes patients with estrogen receptor (ER)-negative tumors (Desmedt 2011; Hatzis 2011). The gene expression data associated with the study is publicly available from the Gene Expression Omnibus database (refer to GSE16446 and GSE25066). The study comprised 614 patients, with 507 receiving only anthracycline-based adjuvant chemotherapy (coded as 0) (Desmedt 2011), and 107 receiving anthracycline with taxane-based chemotherapy (coded as 1) (Hatzis 2011). The gene expression data has been pre-processed (e.g., normalization, filtering out low-expression genes), resulting in 1,689 gene variables for direct analysis. The primary outcome of interest is the distant recurrence-free survival, with a censoring rate of approximately 78% for both groups.

In the second application, we examine the effect of Tamoxifen treatment on patients with ER-positive breast tumors and evaluate gene expression biomarkers and their interactions with treatment (Loi 2007). The original dataset comprises 414 patients from the cohort GSE6532, collected by Loi (2007), to identify ER-positive subtypes with gene expression profiles. Our analysis focuses on the primary outcome of distant metastasis-free survival (DMFS). After excluding 34 patients, who lack any records of time-to-event data (no follow-up or dropout information) for survival outcomes, we are left with 255 patients who received Tamoxifen treatment and 125 patients who did not. The censoring rates for the two groups are 73.3% and 77.6%, respectively, and there are 44,916 gene expression measurements for each patient.

Table 3 Number of selected predictive and prognostic biomarkers

We applied our approach to identify prognostic and predictive gene biomarkers in the two applications and compared it to existing methods. To implement our proposed method, we opted for 50 iterations in the bootstrapping procedure, and we utilized a 70% allocation rate to partition the data into training and testing datasets. The nominal level was set at 0.05, and for each existing method, a feature screening was constructed based on the training set. After obtaining a total of 50 p values for each pair of prognostic and predictive biomarkers, we calculated the aggregated p values for each pair via the quantile function introduced by Meinshausen (2009) with 0.05 as the \(\eta_{min}\).

As shown in Table 3, our proposed method did not identify any prognostic or predictive biomarkers for the first dataset. However, for the second dataset, the interaction of the gene HYPK (’218,680_x_at’) with the treatment indicator was selected with significance, indicating that HYPK is a predictive biomarker for Tamoxifen treatment regarding the outcome of DMFS. This finding is consistent with the literature (Hans-Dieter and RoyerMatthias 2017), where HYPK is suggested as a novel predictive biomarker for breast cancer. LASSO selected the largest number of biomarkers, followed by boosting, adaptive LASSO, and gLASSO, while PCA+LASSO led to the fewest selections. This performance is similar to what we observed in our simulation studies. Additionally, the HYPK gene selected by our proposed method was also identified by the gradient boosting methods. We further listed the gene symbols of selected prognostic and predictive biomarkers based on the methods in Tables S.4-S.5 of the Supplementary Materials.

Per reviewers’ suggestion, we present another data application to further explore our method, as provided in the Supplementary Material. In particular, we analyzed a microarray dataset (refer to GSE22762) (Herold 2011) comprising 151 chronic lymphocytic leukemia (CLL) patients. The primary objective was to identify prognostic and predictive biomarkers associated with overall survival (OS) and salvage chemotherapy. Our proposed method successfully identified 2 predictive biomarkers and 3 prognostic biomarkers. For additional details, please refer to Section D of the Supplementary Material.

5 Discussion

In this paper, we presented a three-stage strategy for identifying prognostic and predictive biomarker signatures, which can be extended to higher-order terms, such as pairwise interactions among the biomarkers, depending on clinical interest or practical necessity. Our work builds upon the concept of multi-splitting for p-value adjustment to identify prognostic and predictive biomarkers under the survival framework, with a focus on Cox PH regressions. Specifically, we extend the approach proposed by (Meinshausen 2009) by generating pairwise p values through joint hypothesis testing. However, we note that if we were to generate p values via individual hypothesis testing, as in the approach proposed by (Meinshausen 2009), the resulting family-wise error rates (FWERs) would be overly conservative when identifying prognostic and predictive biomarkers in our case.

We conducted extensive simulation studies, which demonstrated that our proposed approach can control the FWER well around the nominal level, whereas existing methods such as LASSO, gLASSO, PCA+LASSO, and gradient boosting fail to control FWER. For example, LASSO produces FWERs close to 1 for all scenarios, while boosting, gLASSO, and PCA+LASSO have unstable FWERs across different scenarios. Controlling the FWER in cancer studies can improve the sensitivity of biomarker selection and testing during screening. Additionally, compared with existing methods, our proposed method provides accurate estimates of the effects of selected biomarkers with centers closer to true effects and lower dispersion across a variety of scenarios. The mean square errors and relative bias of the estimates produced by our proposed method are consistently lower across a variety of scenarios. However, gLASSO and PCA+LASSO have the largest variability of estimates, and the boosting method underestimates the effects in most scenarios. Furthermore, our method can control the selection accuracy of prognostic and predictive biomarkers close to 100% with an increase in sample size or underlying biomarker effects or a decrease in the censoring rate. In contrast, existing methods (e.g., gLASSO and PCA+LASSO) have relatively lower and inconsistent selection accuracy of the main effects, regardless of censoring rates.

Furthermore, we applied our proposed method and other existing methods to analyze two breast cancer datasets and a chronic lymphocytic leukemia dataset, as detailed in the supplementary material. While there is no literature revealing true prognostic and predictive biomarkers based on these three gene expression datasets, our proposed method yielded intriguing findings. In the second data example, our proposed method identified the gene HYPK as a predictive biomarker for Tamoxifen treatment regarding the outcome of DMFS. This aligns with a finding from (Hans-Dieter and RoyerMatthias 2017), suggesting HYPK as a novel predictive biomarker for breast cancer. Notably, the gradient boosting method also identified the HYPK gene. For the CLL data application shown in the Supplementary Material, our proposed method successfully identified 2 predictive biomarkers and 3 prognostic biomarkers associated with OS and chemotherapy. Among the selected biomarkers, TCF7 stood out as both a prognostic and predictive biomarker, indicating a significant impact on OS in CLL and providing insights into the effect of chemotherapy on CLL patients. This finding aligns with the observations in the paper by (Herold 2011).

In summary, our proposed approach provides a robust tool for identifying prognostic and predictive biomarkers in cancer studies, demonstrating superior performance in simulation studies compared to existing methods, particularly in terms of controlling the FWER. Noted that the FWER control represents a conservative approach, emphasizing the importance of avoiding any false discoveries within a family of tests, making it more restrictive compared to less stringent methods. Of note, in real-world applications, various factors, such as non-randomized clinical trial, sample size, the distribution of biomarker values, the magnitude of biomarker effects and varied pairwise correlations among biomarkers, may influence the selection results. In our data applications, we present several examples for illustration, with promise findings and the genes selected by our method verified by the existing literature.

Regarding the choice of quantiles for aggregating p values, we have adopted a strategy akin to (Meinshausen 2009), considering a lower bound value of \(\eta_{\text{min}} =0.05\) that has been both suggested and exclusively explored in prior studies (Meinshausen 2009; Renaux 2020; Shi et al. 2023; Buzdugan 2016). While there are alternative methods for p-value aggregation (Mitchell 2015), our context is unique because the p values for each variable are generated through repeated data random-splits, resulting in empirical distributions. Using quantiles to combine and aggregate p values provides a flexible means of error rate control, with the advantage of subjective quartile selection. The challenge of selecting the quartile parameter, denoted as \(\eta_{\text{min}}\) \((0<\eta_{\text{min}} <1)\), has been acknowledged Meinshausen (2009), and there is no universally accepted value for \(\eta_{\text{min}}\) that guarantees error control. However, the outcomes from the chosen value have proven satisfactory, but further exploration can be pursued in this regard.

In terms of other future work, it may be worthwhile to investigate the issue of multicollinearity between gene expression levels, particularly when the correlation among biomarkers is relatively high. Additionally, the accelerated failure time model can be easily adapted into our three-stage framework to derive the p values and identify biomarker signatures when Cox PH models are not appropriate due to violations of the PH assumption. Overall, our proposed method has wide-ranging potential for application in cancer genetics studies and can be readily extended to other areas as necessary.

Fig. 1
figure 1

A matrix of panels for family-wise error rates calculated via our proposed model based on simulation studies; rows represent high or low censoring rates; columns represent four simulation scenarios; x-axis is the sample size and y-axis is FWERs; three types of lines represent different total number of biomarkers \(\widetilde{p}\)

Fig. 2
figure 2

A matrix of panels for coefficient estimates for the high censoring rate; rows represent three biomarker effects; columns represent four simulation scenarios; x-axis represents 11 setups of sample sizes and the numbers of biomarkers; y-axis represents the estimated coefficients; red dotted lines represent the strength of biomarker effects

Fig. 3
figure 3

A matrix of panels for FWERs comparisons for different methods based on simulation studies; rows represent high or low censoring rates; columns represent four simulation scenarios; x-axis represents different sample sizes and the numbers of biomarkers, y-axis represents FWERs; five colored lines represent different methods