当前位置: X-MOL 学术Comput. Struct. Biotechnol. J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
T4SEpp: a pipeline integrating protein language models to predict bacterial type IV secreted effectors
Computational and Structural Biotechnology Journal ( IF 6 ) Pub Date : 2024-01-23 , DOI: 10.1016/j.csbj.2024.01.015
Yueming Hu , Yejun Wang , Xiaotian Hu , Haoyu Chao , Sida Li , Qinyang Ni , Yanyan Zhu , Yixue Hu , Ziyi Zhao , Ming Chen

Many pathogenic bacteria use type IV secretion systems (T4SSs) to deliver effectors (T4SEs) into the cytoplasm of eukaryotic cells, causing diseases. The identification of effectors is a crucial step in understanding the mechanisms of bacterial pathogenicity, but this remains a major challenge. In this study, we used the full-length embedding features generated by six pre-trained protein language models to train classifiers predicting T4SEs and compared their performance. We integrated three modules into a model called T4SEpp. The first module searched for full-length homologs of known T4SEs, signal sequences, and effector domains; the second module fine-tuned a machine learning model using data for a signal sequence feature; and the third module used the three best-performing pre-trained protein language models. T4SEpp outperformed other state-of-the-art (SOTA) software tools, achieving ~0.95 sensitivity at a high specificity of ~0.99, based on the assessment of an independent validation dataset. T4SEpp predicted 13 T4SEs from Helicobacter pylori, including the well-known CagA and 12 other potential ones, among which eleven could potentially interact with human proteins. This suggests that these potential T4SEs may be associated with the pathogenicity of H. pylori. Overall, T4SEpp provides a better solution to assist in the identification of bacterial T4SEs and facilitates studies of bacterial pathogenicity. T4SEpp is freely accessible at https://bis.zju.edu.cn/T4SEpp.



中文翻译:

T4SEpp:集成蛋白质语言模型以预测细菌 IV 型分泌效应子的管道

许多病原菌利用 IV 型分泌系统 (T4SS) 将效应物 (T4SE) 传递到真核细胞的细胞质中,从而引起疾病。效应子的识别是理解细菌致病机制的关键一步,但这仍然是一个重大挑战。在本研究中,我们使用六个预训练的蛋白质语言模型生成的全长嵌入特征来训练预测 T4SE 的分类器并比较它们的性能。我们将三个模块集成到一个名为 T4SEpp 的模型中。第一个模块搜索已知 T4SE、信号序列和效应域的全长同源物;第二个模块使用信号序列特征的数据微调机器学习模型;第三个模块使用了三个性能最佳的预训练蛋白质语言模型。根据独立验证数据集的评估,T4SEpp 的性能优于其他最先进 (SOTA) 软件工具,在约 0.99 的高特异性下实现了约 0.95 的灵敏度。T4SEpp 预测了来自幽门螺杆菌的 13 个 T4SE ,包括众所周知的 CagA 和其他 12 个潜在的 T4SE,其中 11 个可能与人类蛋白质相互作用。这表明这些潜在的T4SEs可能与幽门螺杆菌的致病性有关。总体而言,T4SEpp 提供了更好的解决方案来协助鉴定细菌 T4SE,并促进细菌致病性的研究。T4SEpp 可免费访问https://bis.zju.edu.cn/T4SEpp

更新日期:2024-01-25
down
wechat
bug