Rethinking the applicability domain analysis in QSAR models,Journal of Computer-Aided Molecular Design

当前位置： X-MOL 学术 › J. Comput. Aid. Mol. Des. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Rethinking the applicability domain analysis in QSAR models
Journal of Computer-Aided Molecular Design ( IF 3.5 ) Pub Date : 2024-02-14 , DOI: 10.1007/s10822-024-00550-8
Jose R. Mora , Edgar A. Marquez , Noel Pérez-Pérez , Ernesto Contreras-Torres , Yunierkis Perez-Castillo , Guillermin Agüero-Chapin , Felix Martinez-Rios , Yovani Marrero-Ponce , Stephen J. Barigye

Notwithstanding the wide adoption of the OECD principles (or best practices) for QSAR modeling, disparities between in silico predictions and experimental results are frequent, suggesting that model predictions are often too optimistic. Of these OECD principles, the applicability domain (AD) estimation has been recognized in several reports in the literature to be one of the most challenging, implying that the actual reliability measures of model predictions are often unreliable. Applying tree-based error analysis workflows on 5 QSAR models reported in the literature and available in the QsarDB repository, i.e., androgen receptor bioactivity (agonists, antagonists, and binders, respectively) and membrane permeability (highest membrane permeability and the intrinsic permeability), we demonstrate that predictions erroneously tagged as reliable (AD prediction errors) overwhelmingly correspond to instances in subspaces (cohorts) with the highest prediction error rates, highlighting the inhomogeneity of the AD space. In this sense, we call for more stringent AD analysis guidelines which require the incorporation of model error analysis schemes, to provide critical insight on the reliability of underlying AD algorithms. Additionally, any selected AD method should be rigorously validated to demonstrate its suitability for the model space over which it is applied. These steps will ultimately contribute to more accurate estimations of the reliability of model predictions. Finally, error analysis may also be useful in “rational” model refinement in that data expansion efforts and model retraining are focused on cohorts with the highest error rates.

中文翻译：

重新思考 QSAR 模型中的适用性域分析

尽管 QSAR 建模广泛采用 OECD 原则（或最佳实践），但计算机预测与实验结果之间的差异仍然很常见，这表明模型预测往往过于乐观。在 OECD 的这些原则中，适用性域 (AD) 估计在文献中的几份报告中被认为是最具挑战性的原则之一，这意味着模型预测的实际可靠性度量通常是不可靠的。将基于树的误差分析工作流程应用于文献中报道的和 QsarDB 存储库中提供的 5 个 QSAR 模型，即雄激素受体生物活性（分别为激动剂、拮抗剂和结合剂）和膜通透性（最高膜通透性和内在通透性），我们证明，被错误标记为可靠的预测（AD 预测错误）绝大多数对应于具有最高预测错误率的子空间（群组）中的实例，突出了 AD 空间的不均匀性。从这个意义上说，我们呼吁制定更严格的 AD 分析指南，其中需要结合模型误差分析方案，以提供对底层 AD 算法可靠性的关键见解。此外，任何选定的 AD 方法都应经过严格验证，以证明其适用于所应用的模型空间。这些步骤最终将有助于更准确地估计模型预测的可靠性。最后，错误分析在“理性”模型细化中也可能有用，因为数据扩展工作和模型再训练集中在错误率最高的群体上。

更新日期：2024-02-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>