当前位置: X-MOL 学术Environ. Res. Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Validating spatio-temporal environmental machine learning models: Simpson’s paradox and data splits
Environmental Research Communications ( IF 2.9 ) Pub Date : 2024-03-08 , DOI: 10.1088/2515-7620/ad2e44
Anna Boser

Machine learning has revolutionized environmental sciences by estimating scarce environmental data, such as air quality, land cover type, wildlife population counts, and disease risk. However, current methods for validating these models often ignore the spatial or temporal structure commonly found in environmental data, leading to inaccurate evaluations of model quality. This paper outlines the problems that can arise from such validation methods and describes how to avoid erroneous assumptions about training data structure. In an example on air quality estimation, we show that a poor model with an r 2 of 0.09 can falsely appear to achieve an r 2 value of 0.73 by failing to account for Simpson’s paradox. This same model’s r 2 can further inflate to 0.82 when improperly splitting data. To ensure high-quality synthetic data for research in environmental science, justice, and health, researchers must use validation procedures that reflect the structure of their training data.

中文翻译:

验证时空环境机器学习模型:辛普森悖论和数据分裂

机器学习通过估计空气质量、土地覆盖类型、野生动物数量和疾病风险等稀缺环境数据,彻底改变了环境科学。然而,当前验证这些模型的方法常常忽略环境数据中常见的空间或时间结构,导致模型质量评估不准确。本文概述了此类验证方法可能出现的问题,并描述了如何避免对训练数据结构的错误假设。在一个空气质量估计的例子中,我们展示了一个糟糕的模型r 0.09 中的2可能会错误地实现r 由于未能解释辛普森悖论,2值为 0.73。这个同型号的r 当数据分割不当时,2可能会进一步膨胀到 0.82。为了确保环境科学、司法和健康研究获得高质量的合成数据,研究人员必须使用反映其训练数据结构的验证程序。
更新日期:2024-03-08
down
wechat
bug