当前位置: X-MOL 学术Curr. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Thorough Assessment of Machine Learning Techniques for Predicting Protein-Nucleic Acid Binding Hot Spots
Current Bioinformatics ( IF 4 ) Pub Date : 2024-01-17 , DOI: 10.2174/1574893618666230913090436
Xianzhe Zou 1 , Chen Zhang 1 , MIngyan Tang 1 , Lei Deng 1
Affiliation  

Background: Proteins and nucleic acids are vital biomolecules that contribute significantly to biological life. The precise and efficient identification of hot spots at protein-nucleic acid interfaces is crucial for guiding drug development, advancing protein engineering, and exploring the underlying molecular recognition mechanisms. As experimental methods like alanine scanning mutagenesis prove to be time-consuming and expensive, a growing number of machine learning techniques are being employed to predict hot spots. However, the existing approach is distinguished by a lack of uniform standards, a scarcity of data, and a wide range of attributes. Currently, there is no comprehensive overview or evaluation of this field. As a result, providing a full overview and review is extremely helpful. Methods: In this study, we present an overview of cutting-edge machine learning approaches utilized for hot spot prediction in protein-nucleic acid complexes. Additionally, we outline the feature categories currently in use, derived from relevant biological data sources, and assess conventional feature selection methods based on 600 extracted features. Simultaneously, we create two new benchmark datasets, PDHS87 and PRHS48, and develop distinct binary classification models based on these datasets to evaluate the advantages and disadvantages of various machine-learning techniques. Results: Prediction of protein-nucleic acid interaction hotspots is a challenging task. The study demonstrates that structural neighborhood features play a crucial role in identifying hot spots. The prediction performance can be improved by choosing effective feature selection methods and machine learning methods. Among the existing prediction methods, XGBPRH has the best performance. Conclusion: t is crucial to continue studying hot spot theories, discover new and effective features, add accurate experimental data, and utilize DNA/RNA information. Semi-supervised learning, transfer learning, and ensemble learning can optimize predictive ability. Combining computational docking with machine learning methods can potentially further improve predictive performance.

中文翻译:

用于预测蛋白质-核酸结合热点的机器学习技术的全面评估

背景:蛋白质和核酸是重要的生物分子,对生物生命做出重大贡献。精确有效地识别蛋白质-核酸界面的热点对于指导药物开发、推进蛋白质工程和探索潜在的分子识别机制至关重要。由于丙氨酸扫描诱变等实验方法被证明既耗时又昂贵,越来越多的机器学习技术被用来预测热点。然而,现有方法的特点是缺乏统一标准、数据稀缺和属性广泛。目前,还没有对该领域的全面概述或评估。因此,提供完整的概述和审查非常有帮助。方法:在这项研究中,我们概述了用于蛋白质-核酸复合物热点预测的尖端机器学习方法。此外,我们概述了当前使用的来自相关生物数据源的特征类别,并基于 600 个提取的特征评估传统的特征选择方法。同时,我们创建了两个新的基准数据集 PDHS87 和 PRHS48,并基于这些数据集开发了不同的二元分类模型,以评估各种机器学习技术的优缺点。结果:预测蛋白质-核酸相互作用热点是一项具有挑战性的任务。该研究表明,结构邻域特征在识别热点方面发挥着至关重要的作用。通过选择有效的特征选择方法和机器学习方法可以提高预测性能。在现有的预测方法中,XGBPRH的性能最好。结论:继续研究热点理论、发现新的有效特征、添加准确的实验数据、利用DNA/RNA信息至关重要。半监督学习、迁移学习和集成学习可以优化预测能力。将计算对接与机器学习方法相结合可以进一步提高预测性能。
更新日期:2024-01-17
down
wechat
bug