当前位置: X-MOL 学术Brain Inf. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Effect of data harmonization of multicentric dataset in ASD/TD classification
Brain Informatics Pub Date : 2023-11-25 , DOI: 10.1186/s40708-023-00210-x
Giacomo Serra 1, 2 , Francesca Mainas 1, 2 , Bruno Golosio 1, 2 , Alessandra Retico 3 , Piernicola Oliva 2, 4
Affiliation  

Machine Learning (ML) is nowadays an essential tool in the analysis of Magnetic Resonance Imaging (MRI) data, in particular in the identification of brain correlates in neurological and neurodevelopmental disorders. ML requires datasets of appropriate size for training, which in neuroimaging are typically obtained collecting data from multiple acquisition centers. However, analyzing large multicentric datasets can introduce bias due to differences between acquisition centers. ComBat harmonization is commonly used to address batch effects, but it can lead to data leakage when the entire dataset is used to estimate model parameters. In this study, structural and functional MRI data from the Autism Brain Imaging Data Exchange (ABIDE) collection were used to classify subjects with Autism Spectrum Disorders (ASD) compared to Typical Developing controls (TD). We compared the classical approach (external harmonization) in which harmonization is performed before train/test split, with an harmonization calculated only on the train set (internal harmonization), and with the dataset with no harmonization. The results showed that harmonization using the whole dataset achieved higher discrimination performance, while non-harmonized data and harmonization using only the train set showed similar results, for both structural and connectivity features. We also showed that the higher performances of the external harmonization are not due to larger size of the sample for the estimation of the model and hence these improved performance with the entire dataset may be ascribed to data leakage. In order to prevent this leakage, it is recommended to define the harmonization model solely using the train set.

中文翻译:

多中心数据集数据协调对 ASD/TD 分类的影响

如今,机器学习 (ML) 是分析磁共振成像 (MRI) 数据的重要工具,特别是在识别神经系统和神经发育障碍的大脑相关因素方面。机器学习需要适当大小的数据集进行训练,在神经影像学中,这些数据集通常是从多个采集中心收集数据而获得的。然而,分析大型多中心数据集可能会由于采集中心之间的差异而引入偏差。ComBat 协调通常用于解决批次效应,但当使用整个数据集来估计模型参数时,它可能会导致数据泄漏。在这项研究中,使用来自自闭症脑成像数据交换 (ABIDE) 集合的结构和功能 MRI 数据对自闭症谱系障碍 (ASD) 受试者与典型发育对照 (TD) 进行比较。我们比较了经典方法(外部协调),其中协调是在训练/测试分割之前执行的,与仅在训练集上计算的协调(内部协调)以及没有协调的数据集。结果表明,使用整个数据集的协调实现了更高的区分性能,而对于结构和连接特征,非协调数据和仅使用训练集的协调显示了相似的结果。我们还表明,外部协调的较高性能并不是由于模型估计的样本量较大,因此整个数据集的性能改进可能归因于数据泄漏。为了防止这种泄漏,建议仅使用训练集来定义协调模型。
更新日期:2023-11-26
down
wechat
bug