当前位置: X-MOL 学术J. Comput. Aid. Mol. Des. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling
Journal of Computer-Aided Molecular Design ( IF 3.5 ) Pub Date : 2023-10-07 , DOI: 10.1007/s10822-023-00536-y
Gabriel Corrêa Veríssimo 1 , Simone Queiroz Pantaleão 2 , Philipe de Olveira Fernandes 1 , Jadson Castro Gertrudes 3 , Thales Kronenberger 4 , Kathia Maria Honorio 2, 5 , Vinícius Gonçalves Maltarollo 1
Affiliation  

QSAR models capable of predicting biological, toxicity, and pharmacokinetic properties were widely used to search lead bioactive molecules in chemical databases. The dataset’s preparation to build these models has a strong influence on the quality of the generated models, and sampling requires that the original dataset be divided into training (for model training) and test (for statistical evaluation) sets. This sampling can be done randomly or rationally, but the rational division is superior. In this paper, we present MASSA, a Python tool that can be used to automatically sample datasets by exploring the biological, physicochemical, and structural spaces of molecules using PCA, HCA, and K-modes. The proposed algorithm is very useful when the variables used for QSAR are not available or to construct multiple QSAR models with the same training and test sets, producing models with lower variability and better values for validation metrics. These results were obtained even when the descriptors used in the QSAR/QSPR were different from those used in the separation of training and test sets, indicating that this tool can be used to build models for more than one QSAR/QSPR technique. Finally, this tool also generates useful graphical representations that can provide insights into the data.



中文翻译:

MASSA 算法:QSAR 建模的训练和测试子集的自动理性采样

能够预测生物学、毒性和药代动力学特性的 QSAR 模型被广泛用于在化学数据库中搜索先导生物活性分子。构建这些模型的数据集的准备工作对生成模型的质量有很大影响,采样需要将原始数据集分为训练集(用于模型训练)和测试集(用于统计评估)。这种抽样可以是随机的,也可以是合理的,但合理的划分更优越。在本文中,我们介绍了 MASSA,这是一种 Python 工具,可用于通过使用 PCA、HCA 和 K 模式探索分子的生物、物理化学和结构空间来自动对数据集进行采样。当用于 QSAR 的变量不可用或使用相同的训练和测试集构建多个 QSAR 模型时,所提出的算法非常有用,生成具有较低变异性和更好的验证指标值的模型。即使 QSAR/QSPR 中使用的描述符与训练集和测试集分离中使用的描述符不同,也可以获得这些结果,这表明该工具可用于为多种 QSAR/QSPR 技术构建模型。最后,该工具还生成有用的图形表示,可以提供对数据的见解。

更新日期:2023-10-08
down
wechat
bug