(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:
https://journals.plos.org/plosone/s/licenses-and-copyright
------------
Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package
['Alexander G. Hurley', 'Climate Dynamics', 'Landscape Evolution', 'Gfz German Research Centre For Geosciences', 'Potsdam', 'Richard L. Peters', 'Laboratory Of Plant Ecology', 'Department Of Plants', 'Crops', 'Faculty Of Bioscience Engineering']
Date: 2022-05
Ecological research, just as all Earth System Sciences, is becoming increasingly data-rich. Tools for processing of “big data” are continuously developed to meet corresponding technical and logistical challenges. However, even at smaller scales, data sets may be challenging when best practices in data exploration, quality control and reproducibility are to be met. This can occur when conventional methods, such as generating and assessing diagnostic visualizations or tables, become unfeasible due to time and practicality constraints. Interactive processing can alleviate this issue, and is increasingly utilized to ensure that large data sets are diligently handled. However, recent interactive tools rarely enable data manipulation, may not generate reproducible outputs, or are typically data/domain-specific. We developed datacleanr, an interactive tool that facilitates best practices in data exploration, quality control (e.g., outlier assessment) and flexible processing for multiple tabular data types, including time series and georeferenced data. The package is open-source, and based on the R programming language. A key functionality of datacleanr is the “reproducible recipe”—a translation of all interactive actions into R code, which can be integrated into existing analyses pipelines. This enables researchers experienced with script-based workflows to utilize the strengths of interactive processing without sacrificing their usual work style or functionalities from other (R) packages. We demonstrate the package’s utility by addressing two common issues during data analyses, namely 1) identifying problematic structures and artefacts in hierarchically nested data, and 2) preventing excessive loss of data from ‘coarse,’ code-based filtering of time series. Ultimately, with datacleanr we aim to improve researchers’ workflows and increase confidence in and reproducibility of their results.
Funding: AGH and IH were supported through the Helmholtz-Climate-Initiative (HI-CAM), funded by the Helmholtz Initiative and Networking Fund (
https://www.helmholtz.de/en/about-us/the-association/initiating-and-networking/ ); the authors are responsible for the content of this publication. RLP acknowledges support of the Swiss National Science Foundation (
http://www.snf.ch/ ), Grant P2BSP3_184475. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Data Availability: All software and data necessary for reproducibility are publicly available. The software is available from an online repository at
https://github.com/the-Hull/datacleanr and via CRAN (
https://cran.r-project.org/package=datacleanr ); the software version used for this publication is archived at
https://doi.org/10.5281/zenodo.6337609 . Eddy covariance data are available online from the FLUXNET2015 webpage (
http://fluxnet.fluxdata.org/data/fluxnet2015-dataset/ ). The allometry and trait data is available at
https://github.com/dfalster/baad . The Berlin street and park tree data is available at
https://daten.berlin.de/ . The data to reproduce the profiling and time series cleaning example are archived at
https://doi.org/10.5281/zenodo.4550726 .
Copyright: © 2022 Hurley et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Introduction
Ecology, just as all Earth system sciences, is increasingly data-rich [e.g., 1]. These data are a boon for novel inferences, and increasingly inform decision making [2, 3], for example, through databases from coordinated efforts that facilitate synoptic studies of carbon fluxes [e.g., FLUXNET, 4] and stocks [5] or ecosystem functioning [e.g., via trait databases like TRY, 6]. Low-cost monitoring and sensing solutions have also immensely increased the amount of data individual researchers can produce [e.g., 7]. However, the data deluge—often from heterogeneous sources—introduces new logistical and computational challenges for researchers [7, 8] wanting to maintain best practices in data analyses, reproducibility and transparency [see frameworks on workflow implementation in 9, 10]. It is clear that we need not only frameworks, but also flexible tools to deal with the ever-increasing, heterogeneous data and corresponding issues.
Paramount to any analyses is ensuring the validity of input data through adequate exploration and quality control, which allows identifying any idiosyncrasies, outliers or erroneous structures. However, with growing data volumes this becomes increasingly difficult. Indeed, several definitions establish “big data” at the threshold where single entities (i.e. researchers, institutions, disciplines) are no longer able to manage and process a given data set due to its size or complexity [e.g., 11, 12]. Yet, several current research applications in ecology and Earth system science require handling more than Gigabyte-scale data and regularly lead to the development of dedicated and domain-specific processing pipelines and tools, e.g., processing of raw data from FLUXNET [4] or automating data assimilation from long-term experiments [13].
Individual scientists, however, frequently encounter data sets smaller than this, which nonetheless challenge the feasibility of common data processing and exploration methods. These include the best practice examples of generating static diagnostic/summary visualizations, statistics and tables for detecting problematic observations [e.g., 14, 15]. Data sets of this intermediate scale are termed “high-volume,” rather than “big,” for purposes of this study. Issues with these data often arise when the dimensions and data types require numerous of the aforementioned items (e.g., n-dimensional visualizations), and their individual assessment becomes unfeasible due to time and practicality constraints, even when their generation can be largely automated. Hence, they can pose a challenge even for experienced researchers adept at script-based analyses, if convenient tools do not exist or are financially inaccessible due to commercial licensing. For instance, over-plotting may require generating several static visualizations for nested categorical levels, such as branch, individual tree and forest stand, or for spatial granularity, such as plot, site and region. Furthermore, time series from monitoring equipment may show issues related to sensor drift, step-shifts, or random sensor errors. While gap-filtering, trend-adjustment and outlier-removal algorithms exist for these circumstances [e.g., 16, 17], subsequent manual checking is usually still advised, leading to similar issues as above. For time series, in particular, problematic periods (e.g., systematically-occurring sensor errors) may be removed entirely for convenience in code-based processing; by contrast, interactive engagement down to individual observations may allow applying more diligence and retaining more data.
Ideally, researchers should be able to engage with their data, across scales and dimensions as diligently as needed, with as little effort as possible. Accordingly, interactive processing is increasingly called for and deemed critical [18] for ensuring best practices in data exploration, and quality control when dealing with high-volume data and beyond [e.g., 19, 20]. Indeed, interactive exploration is increasingly provided through open-source graphing frameworks (e.g., plotly;
https://plotly.com/ or, highcharts;
https://highcharts.com/) and/or commercially-licensed software (e.g., Tableau®;
https://tableau.com/). However, actual data manipulation, and especially the generation of subsequent outputs that are fully reproducible, are far less common features; this could potentially stimulate reluctance for sharing analysis code [e.g., 21]. Further issues can arise when outputs are (commercially licensed) platform/software dependent and thus not easily incorporated with other widely-used languages, such as R [22] or python [23]. Interactive, reproducible is, therefore, typically linked to method-specific workflows within research domains, for instance, to annotate images [e.g., 24], acoustic files [e.g., 25], or explore spatial and time series data [e.g., 20, 26].
There is a clear need for interactive tools that can facilitate best practices in processing heterogeneous, high-volume data, while enabling interoperability with reproducible workflows. To address this, we developed datacleanr: an open-source R-based package containing a graphical user interface for rapid data exploration and filtering, as well as interactive visualization and annotation of various data types, including spatial (georeferenced) and time series observations. datacleanr is designed to fit in existing, scripted processing (R) pipelines, without sacrificing the benefits of interactivity: this is achieved through features that allow validating the results of previous quality control, and by generating a code script to repeat any interactive operation. The code script can be slotted into existing workflows, and datacleanr’s output can hence be directly used for subsequent reproducible analyses.
Below we provide an overview of the package. Additionally, we demonstrate datacleanr’s utility with two ecology-based use-cases addressing common issues during data processing: 1) Identifying problematic data structures and artefacts using an urban tree survey, where data is nested by species, street and city district. 2) Preventing excessive loss of data from “coarse,” code-based filtering in messy time series of sap flow data, bolstering subsequent analyses.
Lastly, we provide an outlook for future developments and conclude by inviting the community to contribute to further increase datacleanr’s capabilities and reach.
[END]
[1] Url:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0268426
(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL:
https://creativecommons.org/licenses/by/4.0/
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/