当前位置: X-MOL 学术Anim. Genet. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Identification of population-informative markers from high-density genotyping data through combined feature selection and machine learning algorithms: Application to European autochthonous and cosmopolitan pig breeds
Animal Genetics ( IF 2.4 ) Pub Date : 2024-01-08 , DOI: 10.1111/age.13396
Giuseppina Schiavo 1 , Francesca Bertolini 1 , Samuele Bovo 1 , Giuliano Galimberti 2 , María Muñoz 3 , Riccardo Bozzi 4 , Marjeta Čandek‐Potokar 5 , Cristina Óvilo 3 , Luca Fontanesi 1
Affiliation  

Large genotyping datasets, obtained from high-density single nucleotide polymorphism (SNP) arrays, developed for different livestock species, can be used to describe and differentiate breeds or populations. To identify the most discriminating genetic markers among thousands of genotyped SNPs, a few statistical approaches have been proposed. In this study, we applied the Boruta algorithm, a wrapper of the machine learning random forest algorithm, on a database of 23 European pig breeds (20 autochthonous and three cosmopolitan breeds) genotyped with a 70k SNP chip, to pre-select informative SNPs. To identify different sets of SNPs, these pre-selected markers were then ranked with random forest based on their mean decrease accuracy and mean decrease gene indexes. We evaluated the efficiency of these subsets for breed classification and the usefulness of this approach to detect candidate genes affecting breed-specific phenotypes and relevant production traits that might differ among breeds. The lowest overall classification error (2.3%) was reached with a subpanel including only 398 SNPs (ranked based on their mean decrease accuracy), with no classification error in seven breeds using up to 49 SNPs. Several SNPs of these selected subpanels were in genomic regions in which previous studies had identified signatures of selection or genes associated with morphological or production traits that distinguish the analysed breeds. Therefore, even if these approaches have not been originally designed to identify signatures of selection, the obtained results showed that they could potentially be useful for this purpose.

中文翻译:

通过组合特征选择和机器学习算法从高密度基因分型数据中识别群体信息标记:在欧洲本土和国际化猪品种中的应用

从针对不同牲畜物种开发的高密度单核苷酸多态性 (SNP) 阵列获得的大型基因分型数据集可用于描述和区分品种或种群。为了在数千个基因型 SNP 中识别最具辨别力的遗传标记,已经提出了一些统计方法。在本研究中,我们将 Boruta 算法(机器学习随机森林算法的包装)应用于使用 70k SNP 芯片进行基因分型的 23 个欧洲猪品种(20 个本土猪品种和 3 个国际化品种)的数据库,以预先选择信息丰富的 SNP。为了识别不同的 SNP 集合,然后根据这些预选标记的平均减少精度和平均减少基因指数,使用随机森林对它们进行排序。我们评估了这些子集对品种分类的效率,以及该方法在检测影响品种特异性表型和品种间可能不同的相关生产性状的候选基因方面的有用性。仅包含 398 个 SNP(根据其平均降低准确度排名)的子面板达到了最低的总体分类误差 (2.3%),在使用多达 49 个 SNP 的 7 个品种中没有分类误差。这些选定子面板的几个 SNP 位于基因组区域中,之前的研究已在这些区域中识别出与区分所分析品种的形态或生产性状相关的选择或基因的特征。因此,即使这些方法最初并不是为了识别选择特征而设计的,但获得的结果表明它们可能对此目的有用。
更新日期:2024-01-08
down
wechat
bug