1 Introduction

Characterizing the correlation between two random variables is a classical and fundamental topic in statistics. Although the classical Pearson correlation is widely used in practice, it is well known that the Pearson correlation is only effective to detect linear relationship. To describe nonlinear correlation, in a seminal paper, Székely et al. (2007) introduced the distance correlation which is zero if and only if the two random vectors are independent. The introduction of distance correlation stimulates the strong interests in developing new nonlinear correlations. For instance, Zhu et al. (2017) proposed the projection correlation which does not need any moment condition. Pan et al. (2020) developed Ball covariance to describe dependence in Banach space. Chatterjee (2021) introduced a new correlation with a simple expression. There are also other important nonparametric correlation methods. For instance, Gretton et al. (2005) introduced the Hilbert-Schmidt independence criterion (HSIC) for measuring statistical dependence of the distribution-based Hilbert space embedding in statistical inference. Reshef et al. (2011) proposed the maximal information coefficient (MIC) to evaluate relationships between two random variables, characterized by two properties: generality and equitability. For other important methods, kindly see the review by Josse and Holmes (2016) and Tjøstheim et al. (2022).

Though the above recently developed correlations are powerful to describe nonlinear relationship, they are related with all aspects of the distributions of the two random vectors. However, in many applications such as regression problems, only certain aspect of the distribution is concerned. Following this line, Shao and Zhang (2014) introduced the martingale difference correlation which is a conditional mean version of the distance correlation. It is zero if and only if the two random vectors are conditional mean independent, i.e. the conditional mean of one variable given the other one is equal to its unconditional mean. However when its value is equal to one, the interpretation is not clear. Zheng et al. (2012) proposed the Generalized Measure of Correlation (GMC) which is a nonparametric extension of the classical \(R^2\) in linear regression models. It fully describes the conditional mean dependence. Instead of conditional mean, correlation in terms of conditional quantile is also of great importance and interest. Actually compared with conditional mean regression, conditional quantile regression (Koenker 2005) is robust to outliers, heavy tailed distributions, and can capture more information of the distribution. As an extension of the classical Pearson correlation, Li et al. (2015) introduced a novel quantile correlation. This new quantile correlation has been applied and extended by many authors. For instance, Ma et al. (2017) adopted this quantile correlation to screen features. Ma and Zhang (2016) proposed a composite quantile correlation. However as pointed out by Xu (2017), both the quantile correlation and the composite quantile correlation can be zero even when there is clear relationship.

In this paper, we also introduce correlations in terms of quantile. However, different from the quantile correlation in Li et al. (2015), our correlation can fully capture the conditional quantile dependence. Actually it is a quantile version of the Generalized Measure of Correlation and thus is called quantile Generalized Measure of Correlation (qGMC). The introduced qGMC has several appealing properties. In fact, it takes value between zero and one. The value is zero if and only if the two random vectors are conditionally quantile independent, and is one if and only if one variable is equal to the conditional quantile function. When there is a third random vector, it is also of great interest to develop a partial correlation. To this aim, Wang et al. (2015) introduced conditional distance correlation. Azadkia and Chatterjee (2021) extended the correlation in Chatterjee (2021) and introduced a new partial correlation. While Park et al. (2015) introduced the partial martingale difference correlation. We also extend our quantile Generalized Measure of Correlation and introduce a quantile partial Generalized Measure of Correlation (qPGMC). Compared with the recently developed other partial correlations, the qPGMC is concerned with the quantile aspect.

We also make inference for these new measures. To estimate these new measures, we need to estimate unknown conditional quantile functions. When the dimension is high, classical local polynomial methods do not work well. To this end, modern flexible machine learning methods are adopted to estimate conditional quantile functions. We show that when these conditional quantile functions can be consistently estimated, our new measures can also be consistently estimated. For constructing confidence interval, sample splitting is required to further reduce asymptotic bias. We establish the asymptotic normalities of relevant estimators under mild conditions. We further consider composite quantile GMC by integrating information from different quantile levels. Its properties and estimators are investigated. We illustrate our proposed quantile GMC in a real data analysis.

The paper is organized as follows. In Sect. 2, we introduce our quantile GMC and quantile partial GMC and discuss their properties. In Sect. 3, we make inference for these two new measures. In Sect. 4, we extend to composite quantile GMC. In Sect. 5, numerical studies are conducted. A real data analysis is given in Sect. 6. We then give some discussions and conclusions in Sect. 7. Proofs are given in the Appendix.

2 Quantile generalized measure of correlation and its partial version

Now let \(Y\in \mathbb {R}\) be the one-dimensional random variable and \({\textbf{X}}=({\textbf{Z}}^\top ,\varvec{W}^\top )^\top \in \mathbb {R}^p\) be the p-dimensional random vector. Here \({\textbf{Z}}\in \mathbb {R}^{p_1}\) and \(\varvec{W}\in \mathbb {R}^{p_2}\) with \(p_1+p_2=p\) are two subsets of \({\textbf{X}}\). In the following, we introduce new correlations to quantify the dependence between \({\textbf{X}}\) and Y and the partial dependence between \(\varvec{W}\) and Y given \({\textbf{Z}}\).

To quantify the relationship between \({\textbf{X}}\) and Y, we propose the following quantile Generalized Measure of Correlation (qGMC)

$$\begin{aligned} \hbox {qGMC}_{\tau }(Y,{\textbf{X}})=1-\frac{\mathbb {E}[\rho _{\tau }(Y-q_{\tau }({\textbf{X}}))]}{\mathbb {E}[\rho _{\tau }(Y-\xi _{\tau })]}. \end{aligned}$$
(1)

Here \(\tau \in (0,1), \rho _{\tau }(u)=(\tau -1/2)u+|u|/2=u\{\tau -I(u<0)\}>0\) is the check loss function, \(q_{\tau }({\textbf{X}})=\arg \min _{\theta }\mathbb {E}[\rho _{\tau }(Y-\theta )|{\textbf{X}}]\) is the \(\tau \)th conditional quantile function of Y given \({\textbf{X}}\), while \(\xi _{\tau }=\arg \min _{\theta }\mathbb {E}[\rho _{\tau }(Y-\theta )]\) is the \(\tau \)th unconditional quantile of Y. In other words, we adopt the relative check loss with and without \({\textbf{X}}\) to measure the correlation between Y and \({\textbf{X}}\). From the definitions of \(q_{\tau }({\textbf{X}})\) and \(\xi _{\tau }\), it is clear that \(\mathbb {E}[\rho _{\tau }(Y-q_{\tau }({\textbf{X}}))]\le \mathbb {E}[\rho _{\tau }(Y-\xi _{\tau })]\). Accordingly \(0\le \hbox {qGMC}_{\tau }\le 1\). Recall the definition of GMC in Zheng et al. (2012), that is,

$$\begin{aligned} \hbox {GMC}(Y,{\textbf{X}})=1-\frac{\mathbb {E}[\{Y-\mathbb {E}(Y|{\textbf{X}})\}^2]}{\mathbb {E}[\{Y-\mathbb {E}(Y)\}^2]}. \end{aligned}$$

It is clear that the \(\hbox {qGMC}_{\tau }\) is the quantile version of the GMC and can be interpreted as the quantile version of the nonparametric \(R^2\). Compared with the GMC, the qGMC is robust to outliers, heavy-tailed distributions, and can capture more information by allowing different quantile levels \(\tau \)’s. Further the non-smoothness of the check loss function \(\rho _{\tau }(u)\) makes the inference of qGMC more challenging. We will investigate its inference in the next section.

We further note that our introduced quantile GMC has a genetic interpretation. Actually as in Visscher et al. (2008), the above GMC is also known as the heritability in genetics. The heritability is the proportion of variation in a phenotype that is attributable to genetic factors. It is a fundamental quantity in genetic applications (Yang et al. 2010; Evans et al. 2018; Speed and Balding 2019). Clearly our quantile GMC can be interpreted as quantile heritability. Compared with classical heritability, this quantile heritability is robust to outliers and can capture more information.

Besides the overall quantile correlation of \({\textbf{X}}\) and Y, we are also interested in the partial correlation of Y and \(\varvec{W}\) after adjusting for (or controlling for) \({\textbf{Z}}\). Inspired by the above qGMC, we consider the following partial correlation:

$$\begin{aligned} \hbox {qPGMC}_{\tau }(Y,\varvec{W}|{\textbf{Z}})=1-\frac{\mathbb {E}[\rho _{\tau }(Y-q_{\tau }({\textbf{X}}))]}{\mathbb {E}[\rho _{\tau }(Y-q_{\tau }({\textbf{Z}}))]}. \end{aligned}$$

Here \(q_{\tau }({\textbf{Z}})=\arg \min _{\theta }\mathbb {E}[\rho _{\tau }(Y-\theta )|{\textbf{Z}}]\). If the above defined \(\hbox {qPGMC}_{\tau }\) is small, then \(\mathbb {E}[\rho _{\tau }(Y-q_{\tau }({\textbf{X}}))]\) and \(\mathbb {E}[\rho _{\tau }(Y-q_{\tau }({\textbf{Z}}))]\) are close and thus the additional introduction of \(\varvec{W}\) leads to little increase in explanatory power. Thus, the \(\hbox {qPGMC}_{\tau }\) can be used to measure the importance of \(\varvec{W}\) in explaning Y given \({\textbf{Z}}\).

Denote \(\epsilon =Y-q_{\tau }({\textbf{X}})\). Let \(f_{\epsilon |{\textbf{X}}}\) be the conditional density function of \(\epsilon \) given \({\textbf{X}}\). Denote by \(\delta = Y-\xi _{\tau }\) and \(f_{\delta }\) be the density function of \(\delta \). Similarly define \(\eta =Y-q_{\tau }({\textbf{Z}})\) and \(f_{\eta |{\textbf{Z}}}\). Before presenting the properties and interpretations of \(\hbox {qGMC}_{\tau }\) and \(\hbox {qPGMC}_{\tau }\), the following condition is assumed.

  • (C1). For a around zero, assume that \(0<c<\inf _a f_{\epsilon |{\textbf{X}}}(a)<\sup _{a} f_{\epsilon |{\textbf{X}}}(a)<C<\infty \). The same assumption is imposed for \(f_{\eta |{\textbf{Z}}}\) and \(f_{\delta }\).

The boundedness on the conditional density function in (C1) is often assumed in quantile regression models. See for instance (Wang et al. 2009, 2012; Zhang et al. 2020; Tang et al. 2022).

Now we are ready to present the properties and interpretations of \(\hbox {qGMC}_{\tau }\) and \(\hbox {qPGMC}_{\tau }\) in the following proposition. Its proof is given in the Appendix.

Proposition 1

Under condition (C1), we have

  1. 1.

    \(\hbox {qGMC}_{\tau }\) and \(\hbox {qPGMC}_{\tau }\) are two indicators lying between 0 and 1, that is,

    $$\begin{aligned} 0\le \hbox {qGMC}_{\tau }, \hbox {qPGMC}_{\tau }\le 1. \end{aligned}$$
  2. 2.

    \(\hbox {qGMC}_{\tau }=0\) if and only if \(q_{\tau }({\textbf{X}})=\xi _{\tau },\) a.s. That is, \({\textbf{X}}\) does not add any information about the \(\tau \)th conditional quantile of Y.

  3. 3.

    \(\hbox {qPGMC}_{\tau }=0\) if and only if \(q_{\tau }({\textbf{X}})=q_{\tau }({\textbf{Z}}),\) a.s. That is, given \({\textbf{Z}}\), \(\varvec{W}\) does not add any further information about the \(\tau \)th conditional quantile of Y.

  4. 4.

    \(\hbox {qGMC}_{\tau }=1\) and \(\hbox {qPGMC}_{\tau }=1\) if and only if \(Y=q_{\tau }({\textbf{X}})\), a.s. That is, Y is almost surely equal to the conditional quantile function of \({\textbf{X}}\).

  5. 5.

    Further we have

    $$\begin{aligned} \hbox {qPGMC}_{\tau }(Y,\varvec{W}|{\textbf{Z}})=\frac{\hbox {qGMC}_{\tau }(Y,{\textbf{X}})-\hbox {qGMC}_{\tau }(Y,{\textbf{Z}})}{1-\hbox {qGMC}_{\tau }(Y,{\textbf{Z}})}. \end{aligned}$$
  6. 6.

    Suppose \({\textbf{Z}}\) is a subset of \({\textbf{Z}}'\) and \({\textbf{X}}=({\textbf{Z}}'^\top , \varvec{W}'^\top )^\top \), then we have:

    $$\begin{aligned} \hbox {qGMC}_{\tau }(Y,{\textbf{X}})&\ge \hbox {qPGMC}_{\tau }(Y,\varvec{W}|{\textbf{Z}})\\&\ge \hbox {qPGMC}_{\tau }(Y,\varvec{W}'|{\textbf{Z}}'). \end{aligned}$$

From the above proposition, we know that the introduced quantile GMC and quantile partial GMC enjoy many interesting features. Firstly, they have simple expressions, which makes them easy to interpret. Secondly, they are between zero and one. The values zero and one have clear interpretations as stated in properties (2)–(4). Thirdly, they can capture nonlinear (partial) quantile dependence relationship. Fourthly, the dimensions of \({\textbf{X}}\) and \({\textbf{Z}}\) are not limited to be one and are allowed to be very large. The relationship between the quantile partial GMC and the quantile GMC is established in property (5). Since the quantile GMC is viewed as a quantile generalization of the classical \(R^2\), the quantile partial GMC can then be viewed as the relative difference in the population quantile \(R^2\) obtained using the full set of \({\textbf{X}}\) as compared to the subset of \({\textbf{Z}}\) only. Moreover from property (6), when the conditional information is more, the reminding variables provide less additional information about the response Y. Further the quantile partial GMC is always up bounded by the quantile GMC.

In the following, we make a comparison between the quantile correlation and the quantile partial correlation introduced by Li et al. (2015). Actually the quantile correlation takes the following form

$$\begin{aligned} \hbox {qcor}_{\tau }(Y,{\textbf{X}})&=\frac{\mathbb {E}[\psi _{\tau }(Y-\xi _{\tau })({\textbf{X}}-\mathbb {E}{\textbf{X}})]}{\sqrt{(\tau -\tau ^2)\sigma ^2_X}}\\&=\hbox {corr}(I(Y>\xi _{\tau }), {\textbf{X}}), \end{aligned}$$

where \( \psi _{\tau } (u) =\tau -I(u<0)\) and \(\sigma ^2_X=\hbox {var}({\textbf{X}})\). From the above formula, it is clear that the quantile correlation is actually the Pearson correlation of \(I(Y>\xi _{\tau })\) and \({\textbf{X}}\). Besides, the quantile partial correlation is defined as follows,

$$\begin{aligned} \text {qpcor}_{\tau }(Y, \varvec{W}| {\textbf{Z}}) =\frac{\mathbb {E}\left[ \psi _{\tau }\left( Y-\alpha _{2}-\varvec{\beta }_{2}^{\top } {\textbf{Z}}\right) \varvec{W}\right] }{\sqrt{\left( \tau -\tau ^{2}\right) \sigma _{\varvec{W}\mid {\textbf{Z}}}^{2}}}, \end{aligned}$$

where \({(\alpha _{2}, \varvec{\beta }_{2}^{\top })}^{\top }=\arg \min _{\alpha ,\varvec{\beta }} \mathbb {E}\left[ \rho _{\tau }\left( Y-\alpha -\varvec{\beta }^{\top } {\textbf{Z}}\right) \right] \) and \(\sigma _{\varvec{W}\mid {\textbf{Z}}}^{2}=\mathbb {E}(\varvec{W}-\alpha _{1}\) \(-\varvec{\beta }_{1}^{\top } {\textbf{Z}})^{2}\) with \({(\alpha _{1}, \varvec{\beta }_{1}^{\top })}^{\top }=\arg \min _{\alpha , \varvec{\beta }} \mathbb {E}\left( \varvec{W}-\alpha -\varvec{\beta }^{\top } {\textbf{Z}}\right) ^{2}\).

Compared with our introduced quantile GMC, the quantile correlation has two main drawbacks. Firstly and most importantly, the \(\hbox {qcor}_{\tau }\) can only measure linear relationship between Y and \({\textbf{X}}\) at the \(\tau \)th quantile. While our introduced quantile GMC can detect nonlinear quantile correlation. Secondly, \(\hbox {qcor}_{\tau }\) is only applicable to one dimensional \({\textbf{X}}\), while the dimension of \({\textbf{X}}\) is not limited for our correlation. Actually we can even allow the dimension of \({\textbf{X}}\) be diverging. A similar comparison can be given to \(\hbox {qpcor}_{\tau }\) and \(\hbox {qPGMC}_{\tau }\) and the details are omitted here.

Now we present a simple example to illustrate the drawbacks of \(\hbox {qcor}_{\tau }\) and \(\hbox {qpcor}_{\tau }\).

Example 1

Let \(Y=\varvec{W}^2+ {\textbf{Z}}^4\), where \(\varvec{W}\) and \({\textbf{Z}}\) are independent and both from N(0, 1). From property (4) in Proposition 1, it is clear that \(\hbox {qGMC}_{\tau }(Y,{\textbf{X}})=\hbox {qPGMC}_{\tau }(Y,\varvec{W}|{\textbf{Z}})=1\). However \(\hbox {qcor}_{\tau }(Y,\varvec{W})=\hbox {qcor}_{\tau }(Y,{\textbf{Z}})=\hbox {qpcor}_{\tau }(Y,\varvec{W}|{\textbf{Z}})=0\).

In fact note that

$$\begin{aligned} \mathbb {E}[\psi _{\tau }(Y-\xi _{\tau })(\varvec{W}-\mathbb {E}\varvec{W})]&=\mathbb {E}[\psi _{\tau }(Y-\xi _{\tau })\varvec{W}I(\varvec{W}<0)] \\&\quad +\mathbb {E}[\psi _{\tau }(Y-\xi _{\tau })\varvec{W}I(\varvec{W}>0)]\\&=\mathbb {E}[\psi _{\tau }(Y-\xi _{\tau })\varvec{W}I(\varvec{W}<0)] \\&\quad - \mathbb {E}[\psi _{\tau }(Y-\xi _{\tau })\varvec{W}I(\varvec{W}<0)]\\&= 0. \end{aligned}$$

We then have \(\hbox {qcor}_{\tau }(Y,\varvec{W})=0\). Similarly, \(\hbox {qcor}_{\tau }(Y,{\textbf{Z}})=0\). Further \(\mathbb {E}[\psi _{\tau }(Y-\alpha _{2}\) \(-\varvec{\beta }_{2}^{\top } {\textbf{Z}}) \varvec{W}] = 0\). We then have \(\hbox {qpcor}_{\tau }(Y, \varvec{W}| {\textbf{Z}}) = 0\). This example illustrates that the \(\hbox {qcor}_{\tau }\) and \(\hbox {qpcor}_{\tau }\) cannot detect nonlinear quantile correlation. While our proposed measurements can detect nonlinear quantile correlation.

3 Statistical inference

In this section, we investigate how to estimate \(\hbox {qGMC}_{\tau }\) and \(\hbox {qPGMC}_{\tau }\) and construct confidence intervals for them.

Since both the qGMC and qPGMC involve expectations of check loss, it is natural to consider sample average of relevant variables. For qGMC and qPGMC, we also need to estimate the conditional quantile functions \(q_{\tau }({\textbf{X}})\) and \(q_{\tau }({\textbf{Z}})\). When the dimensions of \({\textbf{X}}\) and \({\textbf{Z}}\) are small, classical nonparametric methods such as kernel smoothing or local polynomial smoothing (Yu and Jones 1998; Kai et al. 2010) can be adopted. However, these classical nonparametric methods generally perform not well when the dimension is large. To be more flexible, we adopt modern machine learning methods (Meinshausen 2006; Athey et al. 2019; Patidar et al. 2023) which perform well in nonlinear and high-dimensional settings. Now suppose \(\mathcal {D}=\{{\textbf{X}}_i, Y_i\}_{i=1}^N\) is an i.i.d sample from the population \(({\textbf{X}}, Y)\). Further suppose \(\hat{\xi }_{\tau }\), \(\hat{q}_{\tau }({\textbf{X}}_i)\) and \(\hat{q}_{\tau }({\textbf{Z}}_i)\) are estimators of \(\xi _{\tau }, q_{\tau }({\textbf{X}}_i)\) and \(q_{\tau }({\textbf{Z}}_i)\), respectively. We then estimate \(\hbox {qGMC}_{\tau }(Y,{\textbf{X}})\) and \(\hbox {qPGMC}_{\tau }(Y,\varvec{W}|{\textbf{Z}})\) as follows:

$$\begin{aligned} \widetilde{\hbox {qGMC}}_{\tau }&=\frac{\sum _{i=1}^N[\rho _{\tau }(Y_i-\hat{\xi }_{\tau })-\rho _{\tau }(Y_i-\hat{q}_{\tau }({\textbf{X}}_i))]}{\sum _{i=1}^N\rho _{\tau }(Y_i-\hat{\xi }_{\tau })};\nonumber \\ \widetilde{\hbox {qPGMC}}_{\tau }&=\frac{\sum _{i=1}^N[\rho _{\tau }(Y_i-\hat{q}_{\tau }({\textbf{Z}}_i))-\rho _{\tau }(Y_i-\hat{q}_{\tau }({\textbf{X}}_i))]}{\sum _{i=1}^N\rho _{\tau }(Y_i-\hat{q}_{\tau }({\textbf{Z}}_i))}. \end{aligned}$$
(2)

To establish the asymptotic properties of \(\widetilde{\hbox {qGMC}}_{\tau }\) and \(\widetilde{\hbox {qPGMC}}_{\tau }\), the following condition is imposed.

  • (A1). \(\sum _{i=1}^N[\hat{q}_{\tau }({\textbf{X}}_i)-q_{\tau }({\textbf{X}}_i)]^2/N=o_p(1)\). The same assumption is imposed for \(\hat{q}_{\tau }({\textbf{Z}})\) and \(\hat{\xi }_{\tau }\).

Clearly condition (A1) here is mild. It only requires that we can estimate the conditional quantile functions consistently. In their condition (A3), Zhao et al. (2019) imposed certain convergence rate on the estimators of \(q_{\tau }({\textbf{X}})\). The following theorem establishes the consistency of the above estimators \(\widetilde{\hbox {qGMC}}_{\tau }\) and \(\widetilde{\hbox {qPGMC}}_{\tau }\).

Theorem 2

Under condition (A1), we have

$$\begin{aligned} {\widetilde{\hbox {qGMC}}}_{\tau }&=\hbox {qGMC}_{\tau }+o_p(1);\\ \widetilde{\hbox {qPGMC}}_{\tau }&=\hbox {qPGMC}_{\tau }+o_p(1). \end{aligned}$$

Although the above estimators enjoy consistency, they can not be adopted to construct confidence intervals since they do not have asymptotic normality. Actually terms like \(\sum _{i=1}^N[\hat{q}_{\tau }({\textbf{X}}_i)-q_{\tau }({\textbf{X}}_i)][I(\epsilon _i\le 0)-\tau ]/N\) are not \(o_p(N^{-1/2})\). Here \(\epsilon =Y-q_{\tau }({\textbf{X}})\). To make these terms be smaller order, sample sample-splitting strategy can be adopted. To be precise, now the data is randomly split into two parts \(\mathcal {D}_1\) and \(\mathcal {D}_2\) with sample size \(n=N/2\). We use \(\mathcal {D}_1\) to estimate \(q_{\tau }({\textbf{X}})\) and \(q_{\tau }({\textbf{Z}})\). Denote the estimators as \(\hat{q}_{\tau ,1}({\textbf{X}}_i)\) and \(\hat{q}_{\tau ,1}({\textbf{Z}}_i)\). We still estimate \(\xi _{\tau }\) using the whole data set \(\mathcal {D}\) and denote the estimator as \(\hat{\xi }_{\tau }\). We then estimate \(\hbox {qGMC}_{\tau }\) and \(\hbox {qPGMC}_{\tau }\) as follows:

$$\begin{aligned} \widehat{\hbox {qGMC}}_1&=\frac{\sum _{i\in \mathcal {D}_2}[\rho _{\tau }(Y_i-\hat{\xi }_{\tau })-\rho _{\tau }(Y_i-\hat{q}_{\tau ,1}({\textbf{X}}_i))]}{\sum _{i\in \mathcal {D}_2}\rho _{\tau }(Y_i-\hat{\xi }_{\tau })};\nonumber \\ \widehat{\hbox {qPGMC}}_1&=\frac{\sum _{i\in \mathcal {D}_2}[\rho _{\tau }(Y_i-\hat{q}_{\tau ,1}({\textbf{Z}}_i))-\rho _{\tau }(Y_i-\hat{q}_{\tau ,1}({\textbf{X}}_i))]}{\sum _{i\in \mathcal {D}_2}\rho _{\tau }(Y_i-\hat{q}_{\tau ,1}({\textbf{Z}}_i))}. \end{aligned}$$
(3)

We further adopt the refitted cross-validation (RCV) idea in Fan et al. (2012) to enhance estimation efficiency. Actually, one can reverse the training set and the validation set, construct two estimators of \({\hbox {qGMC}}_{\tau }\) and \({\hbox {qPGMC}}_{\tau }\) and average the pairs. To be more specific, define \(\widehat{\hbox {qGMC}}_{\tau }=(\widehat{\hbox {qGMC}}_1+ \widehat{\hbox {qGMC}}_2)/2\), where \(\widehat{\hbox {qGMC}}_2\) is similarly defined by swapping the role of \(\mathcal {D}_1\) and \(\mathcal {D}_2\). The same procedure can be applied to estimate \({\hbox {qPGMC}}_{\tau }\) and the details are omitted here.

We summarize our procedure of computing \({\hbox {qGMC}}_{\tau }\) and \({\hbox {qPGMC}}_{\tau }\) in Algorithm 1.

Algorithm 1
figure a

Estimation procedures based on data splitting

To establish the asympototic distributions of \(\widehat{\hbox {qGMC}}_{\tau }\) and \(\widehat{\hbox {qPGMC}}_{\tau }\), the following condition is imposed.

  • (C2). \(\mathbb {E}[\hat{q}_{\tau ,k}({\textbf{X}})-q_{\tau }({\textbf{X}})]^2=o(n^{-1/2}), k=1,2\). The same assumption is imposed for \(\hat{q}_{\tau ,k}({\textbf{Z}})\) and \(\hat{\xi }_1\).

The conditions \(\mathbb {E}[\hat{q}_{\tau ,k}({\textbf{X}})-q_{\tau }({\textbf{X}})]^2=o(n^{-1/2})\) and \(\mathbb {E}[\hat{q}_{\tau ,k}({\textbf{Z}})-q_{\tau }({\textbf{X}})]^2=o(n^{-1/2})\) in (C2) are mild conditions on the estimation errors of estimators of unknown functions, which are available for many machine learning methods. Among many others, see condition (A3) in Zhao et al. (2019), Theorem 2 in Shen et al. (2021), and Theorem 3.1 in Zhong and Wang (2023).

Theorem 3

Under conditions (C1)–(C2), we have

$$\begin{aligned} \sqrt{N}(\widehat{\hbox {qGMC}}_{\tau }-{\hbox {qGMC}}_{\tau })&=\frac{1}{\sqrt{N}} \sum _{i=1}^NL_i+o_p(1)\\&\overset{d}{\rightarrow }\ N(0,\hbox {Var}(L)); \\ \sqrt{N}(\widehat{\hbox {qPGMC}}_{\tau }-{\hbox {qPGMC}}_{\tau })&=\frac{1}{\sqrt{N}} \sum _{i=1}^N K_i+o_p(1)\\&\overset{d}{\rightarrow }\ N(0,\hbox {Var}(K)). \end{aligned}$$

Here \(\overset{d}{\rightarrow }\) denotes convergence in distribution. Further

$$\begin{aligned} L_i&=\frac{\rho _{\tau }(Y_i-\xi _{\tau })-\rho _{\tau }(Y_i-q_{\tau }({\textbf{X}}_i)) }{\mathbb {E}[\rho _{\tau }(Y-\xi _{\tau })]}\\&\quad -\frac{\mathbb {E}[\rho _{\tau }(Y-\xi _{\tau })-\rho _{\tau }(Y-q_{\tau }({\textbf{X}}))]}{\mathbb {E}^2[\rho _{\tau }(Y-\xi _{\tau })]}\rho _{\tau }(Y_i-\xi _{\tau });\\ K_i&=\frac{\rho _{\tau }(Y_i-q_{\tau }({\textbf{Z}}_i))-\rho _{\tau }(Y_i-q_{\tau }({\textbf{X}}_i)) }{\mathbb {E}[\rho _{\tau }(Y-q_{\tau }({\textbf{Z}}))]}\\&\quad -\frac{\mathbb {E}[\rho _{\tau }(Y-q_{\tau }({\textbf{Z}}))-\rho _{\tau }(Y-q_{\tau }({\textbf{X}}))]}{\mathbb {E}^2[\rho _{\tau }(Y-q_{\tau }({\textbf{Z}}))]} \rho _{\tau }\\&\quad \times (Y_i-q_{\tau }({\textbf{Z}}_i)). \end{aligned}$$

The above theorem establishes the asymptotic normalities of \(\widehat{\hbox {qGMC}}_{\tau }\) and \(\widehat{\hbox {qPGMC}}_{\tau }\). Based on the above theorem, we can construct confidence interval for our introduced correlations. However to make inference for \({\hbox {qGMC}}_{\tau }\), we also need to estimate \(\hbox {Var}(L)\). But \(L_i\) involves unknown parameters and unknown functions. To this aim, for \(i\in \mathcal {D}_2\), we can estimate \(L_i\) by

$$\begin{aligned} \hat{L}_i&=\frac{\rho _{\tau }(Y_i-\hat{\xi }_{\tau })-\rho _{\tau }(Y_i-\hat{q}_{\tau ,1}({\textbf{X}}_i)) }{\mathbb {E}_n[\rho _{\tau }(Y-\xi _{\tau })]}\\&\quad -\frac{\mathbb {E}_n[\rho _{\tau }(Y-\xi _{\tau })-\rho _{\tau }(Y-q_{\tau }({\textbf{X}}))]}{\mathbb {E}_n^2[\rho _{\tau }(Y-\xi _{\tau })]}\rho _{\tau }(Y_i-\hat{\xi }_{\tau }). \end{aligned}$$

Here

$$\begin{aligned} \mathbb {E}_n[\rho _{\tau }(Y-\xi _{\tau })]&=\frac{1}{N}\sum _{i=1}^N\rho _{\tau }(Y_i-\hat{\xi }_{\tau });\\ \mathbb {E}_n[\rho _{\tau }(Y-q_{\tau }({\textbf{X}}))]&=\frac{1}{n}\sum _{i\in \mathcal {D}_2}\rho _{\tau }(Y-\hat{q}_{\tau ,1}({\textbf{X}}_i)). \end{aligned}$$

We can estimate \(L_i, i\in \mathcal {D}_1\) similarly by swapping the role of \(\mathcal {D}_1\) and \(\mathcal {D}_2\). Then we estimate \(\hbox {Var}(L)\) by

$$\begin{aligned} \widehat{\hbox {Var}}(L)=\frac{1}{N}\sum _{i=1}^N \hat{L}_i^2. \end{aligned}$$

Then by the Slusky theorem, a confidence interval for \({\hbox {qGMC}}_{\tau }\) can be constructed as follows:

$$\begin{aligned} \left( \widehat{\hbox {qGMC}}_{\tau }-z_{\alpha /2}\cdot \sqrt{\frac{\widehat{\hbox {Var}(L)}}{N}}, \widehat{\hbox {qGMC}}_{\tau }+z_{\alpha /2}\cdot \sqrt{\frac{\widehat{\hbox {Var}(L)}}{N}}\right) . \end{aligned}$$

The same steps can be applied to construct confidence interval for \({\hbox {qPGMC}}_{\tau }\). The details are omitted here.

4 Composite quantile GMC

In above introduced quantile correlation measures, we focus on a specific quantile level. To be more comprehensive, we can propose the following composite quantile GMC and composite quantile partial GMC which are denoted as cqGMC and cqPGMC:

$$\begin{aligned} {\hbox {cqGMC}}&=\int _0^1 {\hbox {qGMC}}_{\tau }\,\omega (\tau )d\tau ,\nonumber \\ {\hbox {cqPGMC}}&=\int _0^1 {\hbox {qPGMC}}_{\tau }\,\omega (\tau )d\tau . \end{aligned}$$
(4)

Here, \(\omega (\tau )>0\) is a suitable weight function over (0, 1) and satisfies that \(\int _0^1\omega (\tau )d\tau =1\). There are different choices for the weight function. If we want to make the weight for the central quantile level heavy, we may consider \(\omega (\tau )=c\tau (1-\tau )\), where c is some constant. If we do not have any prior preference, we may consider \(\omega (\tau )=1\) for \(\tau \in (0,1)\). This results in unweighted average over all quantile levels. This choice has been adopted and recommended in Zou and Yuan (2008) and Kai et al. (2010). It is difficult for us to determine the optimal weight. For simplicity, we follow Zou and Yuan (2008) and Kai et al. (2010) and take \(\omega (\tau )=1\).

From Proposition 1, we have the following results.

Proposition 4

For the above introduced \({\hbox {cqGMC}}\) and \({\hbox {cqPGMC}}\), under condition (C1), we have

  1. 1.

    \({\hbox {cqGMC}}\) and \({\hbox {cqPGMC}}\) are two indicators lying between 0 and 1, that is,

    $$\begin{aligned} 0\le {\hbox {cqGMC}},\quad {\hbox {cqPGMC}}\le 1. \end{aligned}$$
  2. 2.

    \({\hbox {cqGMC}}=0\) if and only if \(q_{\tau }({\textbf{X}})=\xi _{\tau },\) for all quantile levels a.s. That is, \({\textbf{X}}\) is independent with Y.

  3. 3.

    \({\hbox {cqPGMC}}=0\) if and only if \(q_{\tau }({\textbf{X}})=q_{\tau }({\textbf{Z}}),\) for all quantile levels a.s. That is, given \({\textbf{Z}}\), \(\varvec{W}\) is conditional independent with Y.

  4. 4.

    \({\hbox {cqGMC}}=1\) and \({\hbox {cqPGMC}}=1\) if and only if \(Y=q_{\tau }({\textbf{X}})\) for all quantile levels a.s.

From the above proposition, we know that by integrating information from different quantile levels, the composite quantile GMC and composite quantile partial GMC can characterize independence and conditional independence.

We can estimate \({\hbox {cqGMC}}\) and \({\hbox {cqPGMC}}\) by

$$\begin{aligned} \widehat{\hbox {cqGMC}}&=\frac{1}{B}\sum _{j=1}^B\widehat{\hbox {qGMC}}_{\tau _j}\,\omega (\tau _j);\\ \widehat{\hbox {cqPGMC}}&=\frac{1}{B}\sum _{j=1}^B \widehat{\hbox {qPGMC}}_{\tau _j}\,\omega (\tau _j). \end{aligned}$$

Here \(0<\tau _1<\tau _2<\cdots<\tau _B<1\) are B chosen quantile levels. In practice we set B to be 29. The asymptotic normalities of \(\widehat{\hbox {cqGMC}}\) and \(\widehat{\hbox {cqPGMC}}\) can be derived as follows.

Define \(\epsilon _j=Y-q_{\tau _j}({\textbf{X}})\). Let \(f_{\epsilon _j|{\textbf{X}}}\) be the conditional density function of \(\epsilon _j\) given \({\textbf{X}}\). Let \(\delta _j = Y-\xi _{\tau _j}\) and \(f_{\delta _j}\) be the density function of \(\delta _j\). Similarly define \(\eta _j=Y-q_{\tau _j}({\textbf{Z}})\) and \(f_{\eta _j|{\textbf{Z}}}\). Further let \(\hat{q}_{k\tau _j}({\textbf{X}}), \hat{q}_{k\tau _j}({\textbf{Z}})\) be estimators of \(q_{\tau _j}({\textbf{X}}), q_{\tau _j}({\textbf{Z}})\) based on \(\mathcal {D}_k\). Also let \(\hat{\xi }_{\tau _j}\) be the estimator of \(\xi _{\tau _j}\) based on the whole data. To establish the asymptotic distributions of \(\widehat{\hbox {cqGMC}}\) and \(\widehat{\hbox {cqPGMC}}\), the following two conditions are required.

  • (C1’). For a around zero, assume that \(0<c<\inf _a f_{\epsilon _j|{\textbf{X}}}(a)<\sup _{a} f_{\epsilon _j|{\textbf{X}}}(a)<C<\infty \). The same assumption is imposed for \(f_{\eta _j|{\textbf{Z}}}\) and \(f_{\delta _j}\).

  • (C2’). \(\mathbb {E}[\hat{q}_{k\tau _j}({\textbf{X}})-q_{\tau _j}({\textbf{X}})]^2=o(n^{-1/2}), k=1,2\). The same assumption is imposed for \(\hat{q}_{k\tau _j}({\textbf{Z}})\) and \(\hat{\xi }_{j}\).

Clearly these two conditions (C1’) and (C2’) are in the same spirit as conditions (C1) and (C2).

Theorem 5

Under conditions (C1’)–(C2’), we have

Here

$$\begin{aligned} L_{ci}&= \frac{1}{B}\sum _{j=1}^B\left[ \frac{\rho _{\tau _j}(Y_i-\xi _{\tau _j})-\rho _{\tau _j}(Y_i-q_{\tau _j}({\textbf{X}}_i)) }{\mathbb {E}\{\rho _{\tau _j}(Y-\xi _{\tau _j})\}}\omega (\tau _j)\right. \\&\quad -\frac{\mathbb {E}\{\rho _{\tau _j}(Y-\xi _{\tau _j})-\rho _{\tau _j}(Y-q_{\tau _j}({\textbf{X}}))\}}{\mathbb {E}^2\{\rho _{\tau _j}(Y-\xi _{\tau _j})\}}\rho _{\tau _j}\\&\quad \times (Y_i-\xi _{\tau _j})\omega (\tau _j)\biggr ];\\ K_{ci}&= \frac{1}{B}\sum _{j=1}^B\left[ \frac{\rho _{\tau _j}(Y_i-q_{\tau _j}({\textbf{Z}}_i))-\rho _{\tau _j}(Y_i-q_{\tau _j}({\textbf{X}}_i)) }{\mathbb {E}\{\rho _{\tau _j}(Y-q_{\tau _j}({\textbf{Z}}))\}}\omega (\tau _j)\right. \\&\quad -\frac{\mathbb {E}\{\rho _{\tau _j}(Y-q_{\tau _j}({\textbf{Z}}))-\rho _{\tau _j}(Y-q_{\tau _j}({\textbf{X}}))\}}{\mathbb {E}^2\{\rho _{\tau _j}(Y-q_{\tau _j}({\textbf{Z}}))\}}\\&\quad \times \rho _{\tau _j}(Y_i-q_{\tau _j}({\textbf{Z}}_i))\omega (\tau _j)\biggr ]. \end{aligned}$$

This theorem can be proven based on the Theorem 3 and thus the proof is omitted here. Confidence interval can be similarly constructed and the details are omitted.

Remark 1

Motivated by the regional quantile regression methods (Zheng et al. 2015; Park and He 2017; Yoshida 2021; Park et al. 2023), we can further explore the applicability of (4) within specific quantile level region \(\Delta \subseteq (0,1)\), instead of all quantile levels in the interval [0, 1]. To be precise, we introduce the following regional quantile GMC and regional quantile partial GMC, which are denoted as rqGMC and rqPGMC for any quantile level region \(\Delta \):

$$\begin{aligned} {\hbox {rqGMC}}&=\frac{\int _\Delta {\hbox {qGMC}}_{\tau }\,\omega (\tau )d\tau }{\int _{\Delta }\omega (\tau )d\tau },\nonumber \\ {\hbox {rqPGMC}}&=\frac{\int _\Delta {\hbox {qPGMC}}_{\tau }\,\omega (\tau )d\tau }{\int _{\Delta }\omega (\tau )d\tau }. \end{aligned}$$
(5)

In this way, we can focus on some specific quantile-level regions of interest. For instance, to examine the lower quantile, we may choose \(\Delta =[0.05, 0.2]\). Conversely, if we are interested in higher quantile, the interval \(\Delta =[0.75, 0.9]\) can be considered. The properties and confidence intervals for \({\hbox {rqGMC}}\) and \({\hbox {rqPGMC}}\) can be similarly developed. The details are omitted here.

5 Numerical studies

In this section, we present some numerical results to assess the finite sample performance of our proposed measures.

Example 1

Linear model:

$$\begin{aligned} Y_{i}&={\textbf{Z}}_{i}^\top \varvec{\beta }_1 + \varvec{W}_{i}^\top \varvec{\theta }_1 + \frac{1}{4} W_{i1}\varepsilon _{i}\\&\quad + \frac{1}{5} W_{i2}\varepsilon _{i},\quad i=1,2,\dots ,N. \end{aligned}$$

Example 2

Linear model:

$$\begin{aligned} Y_{i}&={\textbf{Z}}_{i}^\top \varvec{\beta }_2 + \varvec{W}_{i}^\top \varvec{\theta }_2 + \frac{1}{4} W_{i1}\varepsilon _{i} \\&\quad + \frac{1}{5} W_{i2}\varepsilon _{i},\quad i=1,2,\dots ,N. \end{aligned}$$

Example 3

Nonlinear model:

$$\begin{aligned} Y_{i}&=\sin ({\textbf{Z}}_{i}^\top \varvec{\beta }_2) +\sin (\varvec{W}_{i}^\top \varvec{\theta }_2) + \frac{1}{4} W_{i1}\varepsilon _{i} \\&\quad + \frac{1}{5} W_{i2}\varepsilon _{i}, \quad i=1,2,\dots ,N. \end{aligned}$$

The predictor \({\textbf{X}}_{i}=({\textbf{Z}}_{i}^\top \in \mathbb {R}^{p_1}, \varvec{W}_{i}^\top \in \mathbb {R}^{p_2})^\top \) with \(p_1=p_2=p/2\) is from the multivariate normal distribution \(N_p({\varvec{0}}_p,{\varvec{\Sigma }})\). And \(W_{ij}\) is the jth element of \(\varvec{W}_{i}\). Here the covariance matrix \({\varvec{\Sigma }}\) is a \(p\times p\)-dimensional matrix following the Toeplitz structure, \(({\varvec{\Sigma }})_{ij}= 0.5^{|i-j|} \) for \(i,j=1,2,\ldots ,p\). Furthermore, the random error term \(\varepsilon _{i}\) is independent with \(({\textbf{Z}}_{i}, \varvec{W}_{i})\). There are two settings for the distribution of \(\varepsilon _{i}\):

  • Setting 1: \(\varepsilon _{i}\) follows the mixture gaussian distribution \( b N(0,1) + (1-b)N(0,8)\) with \(b \sim B(1,0.95)\).

  • Setting 2: \(\varepsilon _{i}\) is simulated from T-distribution with 3 degrees of freedom, i.e. \( \varepsilon _{i}\sim t_3\).

The regression coefficients \(\varvec{\beta }_1\) are set as: \(\beta _{1j} = b_0\) for \(1\le j\le p_1\). Set regression coefficients \(\varvec{\theta }_1\) as: \(\theta _{1j} = c_0\) for \(1\le j\le p_2\). Similarly let \({\beta }_{2j} = d_0\) for \(1\le j\le 3\), \({\beta }_{2j} = 0\) otherwise, and set \(\theta _{2j} = e_0\) for \(1\le j\le 3\), \(\theta _{2j} = 0\) otherwise. Throughout the simulation study, let \(\left\| \varvec{\beta }_1 \right\| = \left\| \varvec{\beta }_2 \right\| =1/\sqrt{3}\) and \(\left\| \varvec{\theta }_1 \right\| = \left\| \varvec{\theta }_2 \right\| =1/\sqrt{2}\). We utilize the Python-package xgboost version 2.0.3 to estimate unknown quantile functions \(q_{\tau }({\textbf{X}})\) and \(q_{\tau }({\textbf{Z}})\). Specifically, for the xgboost configuration, we set objective to reg:quantileerror, tree_method to hist, and quantile_alpha to \(\tau \). The default hyperparameter settings for estimating \(q_{\tau }({\textbf{X}})\) are eta = 0.04 and max_depth = 1. Similarly, for estimating \(q_{\tau }({\textbf{Z}})\), we use eta = 0.01 and max_depth = 1 as the default hyperparameter values in xgboost.

Due to the nonlinearity of quantiles, it is not easy or not direct to calculate the true value of \(\hbox {qGMC}_\tau \) or other criteria. To obtain the true values, we consider a sampling approach. Actually, we generate a sample with a very large sample size, say \(N_0=100{,}000 \), from the generation model. Based on this large sample, we estimate \(\xi _{\tau }, q_{\tau }({\textbf{X}})\) and also \(q_{\tau }({\textbf{Z}})\). We then approximate \(\hbox {qGMC}_{\tau }(Y,{\textbf{X}})\) and \(\hbox {qPGMC}_{\tau }(Y,\varvec{W}|{\textbf{Z}})\) as follows:

$$\begin{aligned}&{\hbox {qGMC}}_{\tau }\approx \frac{\sum _{i=1}^{N_0}[\rho _{\tau }(Y_i-\hat{\xi }_{\tau })-\rho _{\tau }(Y_i-\hat{q}_{\tau }({\textbf{X}}_i))]}{\sum _{i=1}^{N_0}\rho _{\tau }(Y_i-\hat{\xi }_{\tau })};\\&{\hbox {qPGMC}}_{\tau }\approx \frac{\sum _{i=1}^{N_0}[\rho _{\tau }(Y_i-\hat{q}_{\tau }({\textbf{Z}}_i))-\rho _{\tau }(Y_i-\hat{q}_{\tau }({\textbf{X}}_i))]}{\sum _{i=1}^{N_0}\rho _{\tau }(Y_i-\hat{q}_{\tau }({\textbf{Z}}_i))}. \end{aligned}$$

That is, we approximate the true values of \(\hbox {qGMC}_{\tau }(Y,{\textbf{X}})\) and \(\hbox {qPGMC}_{\tau }(Y,\varvec{W}|{\textbf{Z}})\) with the same estimation formulas as in Eq. (2) but with a very large sample size. Since we know the generation model in the simulations, we can easily obtain a sample with sample size as large as possible. When the sample size \(N_0\) is sufficiently large, the sample is close to the underlying population, and thus we can view the above values as the true values.

In our simulation study, our goal is to build confidence intervals for \(\hbox {qGMC}_{\tau }\), \(\hbox {qPGMC}_{\tau }\), \(\hbox {cqGMC}\) and \(\hbox {cqPGMC}\) under different circumstances. In all simulations, we evaluate the coverage probabilities (CP) and the average lengths (AL) of 95% confidence intervals. We set \(p = 4, 6, 200, 500\) and vary the values of sample size \(N=2n\) over \(n = 200, 500\). To make inference for the \({\hbox {qGMC}}_{\tau }\) and \({\hbox {qPGMC}}_{\tau }\) of the simulated data in high-dimensional settings with \(p = 200\) and 500, we employ stability selection (Meinshausen and Bühlmann 2010) on the simulated data. This process aims to reduce the dimensionality of predictors, enabling us to train models using the reduced set of predictors.

We use the quantile levels \(\tau = 0.25, 0.5, 0.75\) in \(\hbox {qGMC}_{\tau }\) and \(\hbox {qPGMC}_{\tau }\), and the quantiles \(\tau _j = j/(B+1) \) for \(j=1,2,\dots ,B\) with \(B = 29\) in \(\hbox {cqGMC}\) and \(\hbox {cqPGMC}\). The results are calculated based on 500 simulation runs.

Corresponding numerical results are presented in Tables 1, 2, 3 and 4. The results in Tables 123 and 4 provide strong evidence that corroborates the asymptotic theory. Actually for all examples, settings, and quantile levels, the coverage probabilities for qGMC, qPGMC, cqGMC, and cqPGMC, are all close to the nominal level 95%. The average lengths of these confidence intervals are also very short. When the sample size n increases, the empirical coverage probabilities are closer to the 95% and the average lengths become smaller. We also find that our procedures still work well even when the dimension is as high as 500.

Table 1 The coverage probabilities (CP) and the average lengths (AL) of 95% confidence intervals for \( {\hbox {qGMC}}\) and \( {\hbox {qPGMC}}\) at \(\tau = 0.25, 0.50\) and 0.75 under Example 1
Table 2 The coverage probabilities (CP) and the average lengths (AL) of 95% confidence intervals for \( {\hbox {cqGMC}}\) and \( {\hbox {cqPGMC}}\) under Example 1
Table 3 The coverage probabilities (CP) and the average lengths (AL) of 95% confidence intervals for \( {\hbox {qGMC}}\) and \( {\hbox {qPGMC}}\) at \(\tau = 0.25, 0.50\) and 0.75 under Example 2
Table 4 The coverage probabilities (CP) and the average lengths (AL) of 95% confidence intervals for \( {\hbox {qGMC}}\) and \( {\hbox {qPGMC}}\) at \(\tau = 0.25, 0.50\) and 0.75 under Example 3

6 Real data analysis

We consider a real dataset consisting of 1200 male Carworth Farms White (CFW) mice, referred to as CFW data, to illustrate the effectiveness of our new measures. This dataset is made publicly accessible by Parker et al. (2016) and has been analyzed by Nicod et al. (2016) and Li and Auwerx (2020). It includes information on 92,734 SNPs of 1200 male CFW mice. The CFW mice in this dataset have been subjected to various phenotyping assessments, such as conditioned fear, anxiety behavior, methamphetamine sensitivity, prepulse inhibition, fasting glucose levels, body weight, tail length, and testis weight. These phenotypes can be categorized into three main groups: behavioral, physiological, and expression quantitative traits.

Table 5 The first five important SNPs and their marginal \(\hbox {qGMC}_\tau \)’s (in parentheses) at \(\tau = 0.25, 0.50\) and 0.75 in CFW data
Table 6 The \(\hbox {qGMC}_\tau \)’s and their 95% confidence intervals (in parentheses) at \(\tau = 0.25, 0.50\) and 0.75 in CFW data

We aim to illustrate that our new measures can be adopted for feature screening and inferring heritability in CFW data. Feature screening methods (Fan and Fan 2008; Zhu et al. 2011; Li et al. 2012) are used to discard as many noise features as possible and at the same time to retain all important features. Heritability is a fundamental quantity in genetic applications (Yang et al. 2010; Evans et al. 2018; Speed and Balding 2019) which specifies the contribution of genetic factors to the variation of a phenotype. For simplicity, we analyze three distinct phenotypes in this paper, which are “D1totaldist0to30”, “tibialength”, and “fastglucose”. “D1totaldist0to30” represents that “baseline” measurements of locomotor response on Day 1 of methamphetamine sensitivity testing. Locomotor activity is measured as the distance traveled (in cm) over 30-min intervals immediately after receiving an injection of saline solution. “tibialength” denotes the length of the tibia bone (in mm). “fastglucose” means Glucose levels in plasma after fasting. We proceed to normalize both the SNPs and the phenotypes separately, taking into account missing values and removing them from the dataset. This normalization process ensures that the data is standardized and ready for further analysis.

Fig. 1
figure 1

Box plot for D1totaldist0to30, tibialength and fastglucose after data standardization

We first conduct feature screening based on our quantile GMC. To be precise, we first calculate the marginal \(\hbox {qGMC}_\tau \) values for each SNP and phenotype. These \(\hbox {qGMC}_\tau \) values represent the marginal associations between each SNP and the phenotype at different quantile levels. We then select the first \(N/\textrm{log}(N)\) SNPs with the largest \(\hbox {qGMC}_\tau \) values. For simplicity, we present the first five SNPs and their corresponding \(\hbox {qGMC}_\tau \) values at \(\tau = 0.25, 0.50\) and 0.75 in Table 5.

Next, we then use all the reduced \(N/\textrm{log}(N)\) SNPs to calculate the overall \(\hbox {qGMC}_\tau \) values for three distinct phenotypes. To be more specific, these \(\hbox {qGMC}_\tau \) values represent correlation between all reduced SNPs and three distinct phenotypes at different quantile levels. We also compute the 95% confidence intervals for these \(\hbox {qGMC}_\tau \) values. These results are shown in Table 6. Each \(\hbox {qGMC}_\tau \) value in Table 6 represents the heritability of the SNPs on each phenotype at a specific quantile level (\(\tau \)), namely 0.25, 0.50, and 0.75.

The marginal \(\hbox {qGMC}_\tau \) values in Table 5 also indicate the proportion of the heritability that can be attributed to a specific SNP at various quantile levels. This information provides valuable insights into the relationship between genetic markers and phenotypic expression. We observe that the SNP rs265727287 is the most important SNP for phenotype “tibialength” at the quantile level being 0.75. This SNP is also observed in Parker et al. (2016). We can also find that the SNP rs244502760 plays an important role in the phenotype “D1totaldist0to30” at both the 0.25 and 0.5 quantile levels. In Table 6, we illustrate the ability of our new measures to calculate the heritability of SNPs on phenotype at different quantile levels. Furthermore, our results may differ somewhat from those obtained in Parker et al. (2016). This discrepancy arises due to the utilization of different correlation measures. While they employed traditional correlation measures, we have opted for quantile GMC. This choice is justified by the observation depicted in Fig. 1, which indicates that the distributions of these phenotypes are heavy-tailed. Therefore, our quantile GMC is better suited for variable screening and the investigation of heritability, particularly when dealing with heavy tail or peak-distributed data.

7 Conclusions and discussions

In this paper, we propose quantile GMC to measure nonlinear quantile correlation. The new introduced measurement has many desirable properties. Notably, it takes value between 0 and 1, the value is 0 if and only if the conditionally quantile function is equal to the unconditional quantile and is 1 if and only if Y is almost surely equal to the conditional quantile function of \({\textbf{X}}\). A partial extension is also introduced. Statistical inferences for the introduced measures are investigated. Flexible machine learning methods are adopted to estimate unknown conditional quantile functions. Under very mild conditions, we establish the estimators’ consistency. However for construction of confidence interval, sample splitting is required. We further consider composite quantile GMC and composite quantile partial GMC by integrating information from different quantile levels. Its properties and estimators are investigated.

Our proposed dependence measures can have many applications. For instance, we may adopt our new measure to variable screening. We can also extend our measurement to censored data. Besides, we may extend check loss to other loss functions, such as huber loss. Moreover, when Y is multivariate, how to define a suitable measurement and make inference is also of great interest. These topics are pursued in near future.