当前位置: X-MOL 学术Groundw. Monit. Remediat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Data Management: An Arcane Old Friend Becoming “the Fourth Paradigm” of Science
Groundwater Monitoring & Remediation ( IF 1.9 ) Pub Date : 2022-08-03 , DOI: 10.1111/gwmr.12548
J. Blotevogel , C. Newell , J. Meyer , K. Karimi Askarani , J.F. Devlin

The problem of managing data was largely recognized with the advent of digital computing in the 1950s. In the early days, data management was little more than the physical storage of tapes and punch cards. Not long afterwards the organization of the data became the focus of data management, and databases were born. The concept of “data management” is credited to the Association of Data Processing Service Organizations who were concerned with training and quality assurance metrics (Foote 2022). Over the past six decades, data management has become an umbrella term that includes such concerns as data governance, the enforcement of policies around data, data storage, database management, and data cataloging. In addition to dealing with the data themselves, data management involves the curation of metadata, which includes information such as names of the data collectors, the dates and locations of data acquisition, and a persistent identifier such as a URL or DOI number.

As little as a decade ago, data management for many scientists and engineers amounted to the production of reports and theses, often with appendices that presented tables of data. In some cases, these were supported by computer files of the tables. However, that practice is not feasible when large datasets are involved, so-called “big-data.” Moreover, old Visicalc, Lotus 1,2,3, or even Excel files, not to mention the media on which they were stored, are not guaranteed to be reliable long-term viable storage platforms. Today, data management invokes the use of cloud storage, data lakes and data warehouses. In 2011 the National Science Foundation began to require data management plans in proposals, and other agencies such as the U.S. EPA, the U.S. Geological Survey, U.S. Department of Energy's, and U.S. Department of Defense are similarly concerned with data archiving. In addition to funding agencies, journals are now requiring datasets to be made available on persistent platforms readily available to readers. With the advent of hydrogeology from satellites, aquifer monitoring with passive sensors, and high-resolution chemical analyses, enormous datasets will become the norm for many practitioners and clients. Sophisticated tools that utilize artificial intelligence will be employed to query the data and, for some applications, make operational decisions in real time.

With massive and more diverse datasets, plus powerful data analysis tools such as Machine Learning, Data Visualization, and Exploratory Data Analysis, some have postulated that we may be entering a new paradigm for scientific discovery: the so-called “Fourth Paradigm.” This concept, attributed to Jim Gray of Microsoft (d. 2007) (Microsoft Research 2022), observed that the history of science has been characterized by three key paradigms: (1) empirical evidence (e.g., Mendel and genetics); (2) scientific theory (e.g., Einstein and physics), and more recently (3) computational science (e.g., groundwater fate and transport models). However, Gray postulated that the exponential growth of “big data” and powerful data analysis tools are now leading science into a “Fourth Paradigm,” where scientific discovery will be accelerated by data-intensive approaches (Hey et al. 2009).

For those of us who have worked with data and information, the world is quickly changing. In this issue we present a snapshot of data management/analysis examples in several articles reflective of current practices in the emergent “Fourth Paradigm” of science. It may be interesting to return to this theme in 5 years and contrast the change in data management and analysis practices and how they influence the progression of scientific knowledge in the groundwater field.



中文翻译:

数据管理:一个神秘的老朋友成为科学的“第四范式”

随着 1950 年代数字计算的出现,管理数据的问题在很大程度上得到了认可。在早期,数据管理只不过是磁带和穿孔卡的物理存储。不久之后,数据的组织成为数据管理的重点,数据库诞生了。“数据管理”的概念归功于关注培训和质量保证指标的数据处理服务组织协会(Foote  2022)。在过去的 6 年中,数据管理已成为一个总称,包括数据治理、数据相关政策的实施、数据存储、数据库管理和数据编目等问题。除了处理数据本身之外,数据管理还涉及元数据的管理,其中包括数据收集者的名称、数据获取的日期和位置以及 URL 或 DOI 号等持久标识符等信息。

就在十年前,许多科学家和工程师的数据管理相当于生成报告和论文,通常带有提供数据表格的附录。在某些情况下,这些是由表格的计算机文件支持的。然而,当涉及到大型数据集,即所谓的“大数据”时,这种做法是不可行的。此外,旧的 Visicalc、Lotus 1、2、3 甚至 Excel 文件,更不用说存储它们的媒体,都不能保证是可靠的长期可行的存储平台。今天,数据管理需要使用云存储、数据湖和数据仓库。2011 年美国国家科学基金会开始在提案中要求数据管理计划,其他机构如美国环保署、美国地质调查局、美国能源部,美国国防部同样关注数据归档。除了资助机构之外,期刊现在还要求将数据集提供给读者随时可用的持久平台。随着卫星水文地质学、无源传感器含水层监测和高分辨率化学分析的出现,庞大的数据集将成为许多从业者和客户的常态。将使用利用人工智能的复杂工具来查询数据,并在某些应用程序中实时做出运营决策。

有了海量和更多样化的数据集,再加上机器学习、数据可视化和探索性数据分析等强大的数据分析工具,有人推测我们可能正在进入一个新的科学发现范式:所谓的“第四范式”。这个概念,归因于 Microsoft 的 Jim Gray(d. 2007)(Microsoft Research  2022),观察到科学史以三个关键范式为特征:(1)经验证据(例如,孟德尔和遗传学);(2) 科学理论(例如,爱因斯坦和物理学),以及最近的 (3) 计算科学(例如,地下水归宿和传输模型)。然而,格雷假设“大数据”的指数级增长和强大的数据分析工具现在正将科学带入“第四范式”,其中科学发现将通过数据密集型方法加速(Hey et al.  2009)。

对于我们这些处理过数据和信息的人来说,世界正在迅速变化。在本期中,我们在几篇文章中介绍了数据管理/分析示例的快照,这些文章反映了新兴科学“第四范式”中的当前实践。5 年后回到这个主题并对比数据管理和分析实践的变化以及它们如何影响地下水领域科学知识的进步可能会很有趣。

更新日期:2022-08-03
down
wechat
bug