SIMPD: an algorithm for generating simulated time splits for validating machine learning approaches,Journal of Cheminformatics

当前位置： X-MOL 学术 › J. Cheminfom. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SIMPD: an algorithm for generating simulated time splits for validating machine learning approaches
Journal of Cheminformatics ( IF 8.6 ) Pub Date : 2023-12-11 , DOI: 10.1186/s13321-023-00787-9
Gregory A. Landrum , Maximilian Beckers , Jessica Lanini , Nadine Schneider , Nikolaus Stiefl , Sereina Riniker

Time-split cross-validation is broadly recognized as the gold standard for validating predictive models intended for use in medicinal chemistry projects. Unfortunately this type of data is not broadly available outside of large pharmaceutical research organizations. Here we introduce the SIMPD (simulated medicinal chemistry project data) algorithm to split public data sets into training and test sets that mimic the differences observed in real-world medicinal chemistry project data sets. SIMPD uses a multi-objective genetic algorithm with objectives derived from an extensive analysis of the differences between early and late compounds in more than 130 lead-optimization projects run within the Novartis Institutes for BioMedical Research. Applying SIMPD to the real-world data sets produced training/test splits which more accurately reflect the differences in properties and machine-learning performance observed for temporal splits than other standard approaches like random or neighbor splits. We applied the SIMPD algorithm to bioactivity data extracted from ChEMBL and created 99 public data sets which can be used for validating machine-learning models intended for use in the setting of a medicinal chemistry project. The SIMPD code and simulated data sets are available under open-source/open-data licenses at github.com/rinikerlab/molecular_time_series.

中文翻译：

SIMPD：一种生成模拟时间分割以验证机器学习方法的算法

时间分割交叉验证被广泛认为是验证用于药物化学项目的预测模型的黄金标准。不幸的是，这种类型的数据在大型药物研究组织之外并不广泛可用。在这里，我们介绍 SIMPD（模拟药物化学项目数据）算法，将公共数据集拆分为训练集和测试集，模拟现实世界药物化学项目数据集中观察到的差异。SIMPD 使用多目标遗传算法，其目标源自对诺华生物医学研究所运行的 130 多个先导化合物优化项目中早期化合物和晚期化合物之间差异的广泛分析。将 SIMPD 应用于现实世界的数据集会产生训练/测试分割，与随机或邻居分割等其他标准方法相比，它更准确地反映了时间分割所观察到的属性和机器学习性能的差异。我们将 SIMPD 算法应用于从 ChEMBL 中提取的生物活性数据，并创建了 99 个公共数据集，这些数据集可用于验证旨在用于药物化学项目设置的机器学习模型。SIMPD 代码和模拟数据集可根据开源/开放数据许可证在 github.com/rinikerlab/molecular_time_series 上获取。

更新日期：2023-12-11

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>