Implementing GitHub Actions continuous integration to reduce error rates in ecological data collection
- Accurate field data are essential to understanding ecological systems and forecasting their responses to global change. Yet, data collection errors are common, and data analysis often lags far enough behind its collection that many errors can no longer be corrected, nor can anomalous observations be revisited. Needed is a system in which data quality assurance and control (QA/QC), along with the production of basic data summaries, can be automated immediately following data collection.
- Here, we implement and test a system to satisfy these needs. For two annual tree mortality censuses and a dendrometer band survey at two forest research sites, we used GitHub Actions continuous integration (CI) to automate data QA/QC and run routine data wrangling scripts to produce cleaned datasets ready for analysis.
- This system automation had numerous benefits, including (1) the production of near real-time information on data collection status and errors requiring correction, resulting in final datasets free of detectable errors, (2) an apparent learning effect among field technicians, wherein original error rates in field data collection declined significantly following implementation of the system, and (3) an assurance of computational reproducibility—that is, robustness of the system to changes in code, data and software.
- By implementing CI, researchers can ensure that datasets are free of any errors for which a test can be coded. The result is dramatically improved data quality, increased skill among field technicians, and reduced need for expert oversight. Furthermore, we view CI implementation as a first step towards a data collection and analysis pipeline that is also more responsive to rapidly changing ecological dynamics, making it better suited to study ecological systems in the current era of rapid environmental change.
Journal:Methods in Ecology and Evolution