当前位置: X-MOL 学术BMC Genomics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Many purported pseudogenes in bacterial genomes are bona fide genes
BMC Genomics ( IF 4.4 ) Pub Date : 2024-04-15 , DOI: 10.1186/s12864-024-10137-0
Nicholas P. Cooley , Erik S. Wright

Microbial genomes are largely comprised of protein coding sequences, yet some genomes contain many pseudogenes caused by frameshifts or internal stop codons. These pseudogenes are believed to result from gene degradation during evolution but could also be technical artifacts of genome sequencing or assembly. Using a combination of observational and experimental data, we show that many putative pseudogenes are attributable to errors that are incorporated into genomes during assembly. Within 126,564 publicly available genomes, we observed that nearly identical genomes often substantially differed in pseudogene counts. Causal inference implicated assembler, sequencing platform, and coverage as likely causative factors. Reassembly of genomes from raw reads confirmed that each variable affects the number of putative pseudogenes in an assembly. Furthermore, simulated sequencing reads corroborated our observations that the quality and quantity of raw data can significantly impact the number of pseudogenes in an assembler dependent fashion. The number of unexpected pseudogenes due to internal stops was highly correlated (R2 = 0.96) with average nucleotide identity to the ground truth genome, implying relative pseudogene counts can be used as a proxy for overall assembly correctness. Applying our method to assemblies in RefSeq resulted in rejection of 3.6% of assemblies due to significantly elevated pseudogene counts. Reassembly from real reads obtained from high coverage genomes showed considerable variability in spurious pseudogenes beyond that observed with simulated reads, reinforcing the finding that high coverage is necessary to mitigate assembly errors. Collectively, these results demonstrate that many pseudogenes in microbial genome assemblies are actually genes. Our results suggest that high read coverage is required for correct assembly and indicate an inflated number of pseudogenes due to internal stops is indicative of poor overall assembly quality.

中文翻译:

细菌基因组中许多所谓的假基因都是真实的基因

微生物基因组主要由蛋白质编码序列组成,但一些基因组包含许多由移码或内部终止密码子引起的假基因。这些假基因被认为是进化过程中基因降解的结果,但也可能是基因组测序或组装的技术产物。结合观察和实验数据,我们发现许多假定的假基因可归因于组装过程中并入基因组的错误。在 126,564 个公开可用的基因组中,我们观察到几乎相同的基因组在假基因计数方面通常存在很大差异。因果推断表明组装程序、测序平台和覆盖范围是可能的致病因素。从原始读数重新组装基因组证实每个变量都会影响组装中假定的假基因的数量。此外,模拟测序读数证实了我们的观察结果,即原始数据的质量和数量可以以依赖于组装器的方式显着影响假基因的数量。由于内部终止而产生的意外假基因数量与真实基因组的平均核苷酸同一性高度相关(R2 = 0.96),这意味着相对假基因计数可以用作整体组装正确性的代理。将我们的方法应用于 RefSeq 中的组件,由于假基因计数显着升高,导致 3.6% 的组件被拒绝。从高覆盖度基因组获得的真实读数的重新组装显示,虚假假基因的变异性超出了模拟读数所观察到的相当大的变异性,这证实了高覆盖度对于减少组装错误是必要的这一发现。总的来说,这些结果表明微生物基因组组装中的许多假基因实际上是基因。我们的结果表明,正确的组装需要高读取覆盖率,并且表明由于内部终止而导致的假基因数量膨胀表明整体组装质量较差。
更新日期:2024-04-15
down
wechat
bug