当前位置: X-MOL 学术Comput. Struct. Biotechnol. J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Large language models assisted multi-effect variants mining on cerebral cavernous malformation familial whole genome sequencing
Computational and Structural Biotechnology Journal ( IF 6 ) Pub Date : 2024-02-01 , DOI: 10.1016/j.csbj.2024.01.014
Yiqi Wang , Jinmei Zuo , Chao Duan , Hao Peng , Jia Huang , Liang Zhao , Li Zhang , Zhiqiang Dong

Cerebral cavernous malformation (CCM) is a polygenic disease with intricate genetic interactions contributing to quantitative pathogenesis across multiple factors. The principal pathogenic genes of CCM, specifically KRIT1, CCM2, and PDCD10, have been reported, accompanied by a growing wealth of genetic data related to mutations. Furthermore, numerous other molecules associated with CCM have been unearthed. However, tackling such massive volumes of unstructured data remains challenging until the advent of advanced large language models. In this study, we developed an automated analytical pipeline specialized in single nucleotide variants (SNVs) related biomedical text analysis called BRLM. To facilitate this, BioBERT was employed to vectorize the rich information of SNVs, while a deep residue network was used to discriminate the classes of the SNVs. BRLM was initially constructed on mutations from 12 different types of TCGA cancers, achieving an accuracy exceeding 99%. It was further examined for CCM mutations in familial sequencing data analysis, highlighting an upstream master regulator gene fibroblast growth factor 1 (FGF1). With multi-omics characterization and validation in biological function, FGF1 demonstrated to play a significant role in the development of CCMs, which proved the effectiveness of our model. The BRLM web server is available at .

中文翻译:

大语言模型辅助脑海绵状血管瘤家族全基因组测序多效变异挖掘

脑海绵状血管瘤(CCM)是一种多基因疾病,其复杂的遗传相互作用导致多种因素的定量发病机制。CCM 的主要致病基因,特别是 KRIT1、CCM2 和 PDCD10 已被报道,伴随着越来越丰富的与突变相关的遗传数据。此外,还发现了许多与 CCM 相关的其他分子。然而,在高级大型语言模型出现之前,处理如此大量的非结构化数据仍然具有挑战性。在这项研究中,我们开发了一种专门用于单核苷酸变异 (SNV) 相关生物医学文本分析的自动化分析管道,称为 BRLM。为了实现这一点,采用 BioBERT 对 SNV 的丰富信息进行矢量化,同时使用深度残基网络来区分 SNV 的类别。BRLM 最初是根据 12 种不同类型 TCGA 癌症的突变构建的,准确率超过 99%。在家族测序数据分析中进一步检查了 CCM 突变,突出显示了上游主调节基因成纤维细胞生长因子 1 (FGF1)。通过多组学表征和生物学功能验证,FGF1被证明在CCM的发展中发挥着重要作用,这证明了我们模型的有效性。BRLM Web 服务器位于 。
更新日期:2024-02-01
down
wechat
bug