当前位置: X-MOL 学术J. R. Stat. Soc. Ser. C Appl. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Statistical integration of heterogeneous omics data: Probabilistic two-way partial least squares (PO2PLS)
The Journal of the Royal Statistical Society: Series C (Applied Statistics) ( IF 1.6 ) Pub Date : 2022-08-16 , DOI: 10.1111/rssc.12583
Said el Bouhaddani 1 , Hae‐Won Uh 1 , Geurt Jongbloed 2 , Jeanine Houwing‐Duistermaat 1, 3, 4
Affiliation  

The availability of multi-omics data has revolutionized the life sciences by creating avenues for integrated system-level approaches. Data integration links the information across datasets to better understand the underlying biological processes. However, high dimensionality, correlations and heterogeneity pose statistical and computational challenges. We propose a general framework, probabilistic two-way partial least squares (PO2PLS), that addresses these challenges. PO2PLS models the relationship between two datasets using joint and data-specific latent variables. For maximum likelihood estimation of the parameters, we propose a novel fast EM algorithm and show that the estimator is asymptotically normally distributed. A global test for the relationship between two datasets is proposed, specifically addressing the high dimensionality, and its asymptotic distribution is derived. Notably, several existing data integration methods are special cases of PO2PLS. Via extensive simulations, we show that PO2PLS performs better than alternatives in feature selection and prediction performance. In addition, the asymptotic distribution appears to hold when the sample size is sufficiently large. We illustrate PO2PLS with two examples from commonly used study designs: a large population cohort and a small case–control study. Besides recovering known relationships, PO2PLS also identified novel findings. The methods are implemented in our R-package PO2PLS.

中文翻译:

异构组学数据的统计整合:概率双向偏最小二乘法(PO2PLS)

多组学数据的可用性通过为集成系统级方法创造途径彻底改变了生命科学。数据集成将跨数据集的信息链接起来,以更好地理解潜在的生物过程。然而,高维度、相关性和异质性带来了统计和计算方面的挑战。我们提出了一个通用框架,即概率双向偏最小二乘法 (PO2PLS),以应对这些挑战。PO2PLS 使用联合和数据特定的潜在变量对两个数据集之间的关系进行建模。对于参数的最大似然估计,我们提出了一种新颖的快速 EM 算法,并表明估计量呈渐近正态分布。提出了对两个数据集之间关系的全局测试,特别是解决高维问题,并导出其渐近分布。值得注意的是,现有的几种数据集成方法都是 PO2PLS 的特例。通过广泛的模拟,我们表明 PO2PLS 在特征选择和预测性能方面比替代方案表现更好。此外,当样本量足够大时,渐近分布似乎成立。我们用两个来自常用研究设计的例子来说明 PO2PLS:一个大型人群队列和一个小型病例对照研究。除了恢复已知关系外,PO2PLS 还发现了新发现。这些方法在我们的 R 包中实现 此外,当样本量足够大时,渐近分布似乎成立。我们用两个来自常用研究设计的例子来说明 PO2PLS:一个大型人群队列和一个小型病例对照研究。除了恢复已知关系外,PO2PLS 还发现了新发现。这些方法在我们的 R 包中实现 此外,当样本量足够大时,渐近分布似乎成立。我们用两个来自常用研究设计的例子来说明 PO2PLS:一个大型人群队列和一个小型病例对照研究。除了恢复已知关系外,PO2PLS 还发现了新发现。这些方法在我们的 R 包中实现PO2PLS
更新日期:2022-08-16
down
wechat
bug