当前位置: X-MOL 学术Scand. J. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A new paradigm for high-dimensional data: Distance-based semiparametric feature aggregation framework via between-subject attributes
Scandinavian Journal of Statistics ( IF 1 ) Pub Date : 2023-11-08 , DOI: 10.1111/sjos.12695
Jinyuan Liu 1 , Xinlian Zhang 2 , Tuo Lin 2 , Ruohui Chen 2 , Yuan Zhong 3 , Tian Chen 4 , Tsungchin Wu 2 , Chenyu Liu 2 , Anna Huang 5 , Tanya T. Nguyen 6, 7 , Ellen E. Lee 6, 8 , Dilip V. Jeste 9 , Xin M. Tu 2
Affiliation  

This article proposes a distance-based framework incentivized by the paradigm shift toward feature aggregation for high-dimensional data, which does not rely on the sparse-feature assumption or the permutation-based inference. Focusing on distance-based outcomes that preserve information without truncating any features, a class of semiparametric regression has been developed, which encapsulates multiple sources of high-dimensional variables using pairwise outcomes of between-subject attributes. Further, we propose a strategy to address the interlocking correlations among pairs via the U-statistics-based estimating equations (UGEE), which correspond to their unique efficient influence function (EIF). Hence, the resulting semiparametric estimators are robust to distributional misspecification while enjoying root-n consistency and asymptotic optimality to facilitate inference. In essence, the proposed approach not only circumvents information loss due to feature selection but also improves the model's interpretability and computational feasibility. Simulation studies and applications to the human microbiome and wearables data are provided, where the feature dimensions are tens of thousands.

中文翻译:

高维数据的新范例:通过主体间属性的基于距离的半参数特征聚合框架

本文提出了一种基于距离的框架,该框架受到高维数据特征聚合范式转变的激励,该框架不依赖于稀疏特征假设或基于排列的推理。专注于在不截断任何特征的情况下保留信息的基于距离的结果,开发了一类半参数回归,它使用对象间属性的成对结果封装高维变量的多个来源。此外,我们提出了一种策略,通过基于 U 统计的估计方程(UGEE)来解决对之间的连锁相关性,该方程对应于它们独特的有效影响函数(EIF)。因此,所得的半参数估计量对于分布错误指定具有鲁棒性,同时具有根n一致性和渐近最优性以促进推理。从本质上讲,所提出的方法不仅避免了由于特征选择而导致的信息丢失,而且还提高了模型的可解释性和计算可行性。提供了人体微生物组和可穿戴设备数据的模拟研究和应用,其中特征维度为数万。
更新日期:2023-11-08
down
wechat
bug