Reproducibility in Big Data with the repro package
Peikert, Aaron and Brandmaier, Andreas M.
Background
The rules of “good scientific practice” mandate that a research artefact is reproducible. Reproducibility is ensured if the same results can be obtained with the same data from the same analysis. Big data applications are threatened by non-reproducibility because data often come from multiple sources, are large and messy, and preprocessing may rely on various software packages. Then, it becomes increasingly difficult to track and document all steps of an analysis pipeline and guarantee their reproducibility.
Objectives
Big data are typically characterised by volume, variety, and velocity. Increased volume implies the need for distributed computing. Variety of data sources requires us to pay close attention to how data objects flow through our analysis. Velocity demands from our results to be updated dynamically. Reproduction in times of big data can hence no longer be a manual task for human researchers but must be supported by computer tools. Four concepts are necessary to meet these demands:
- Software management allows for distributed computing,
- dependency tracking coordinates the data flow,
- dynamic document creation keeps the results consistent and up-to-date, and
- version control to track changes over time
Approach
Increasing the degree of automation is crucial to ensure reproducibility in big data applications. We propose to adapt and apply tools in research contexts that were originally meant for software development:
- Software management with
Docker
ensures a stable software environment across changing computing environments—even across thousands of nodes in distributed environments, - dependency tracking with
Make
documents and automates complex processing pipelines, - dynamic document creation with
RMarkdown
helps to recreate manuscripts describing results, and - version control with
Git
tracks snapshots of the analysis workflow over time.
While these tools have proven to be effective, their origin in software engineering requires a steep learning curve for researchers, who are typically not trained in using them. We believe a layer of abstraction may ease access to these tools and their merits for reproducibility. The R package repro
wraps everyday tasks into composable building blocks. repro
makes it easier to follow best practices by automatically configuring the needed tools.
Implications
To meet the demands of reproducibility in big data, the research community must move to a structured and automated approach. This will require researchers to adopt new tools and workflows. This investment upfront will pay for itself through more robust analyses that scale better, are well structured, and allow future users to recreate past results. Last, following the proposed best practices of reproducibility has benefits for collaboration of multiple authors on a single analysis.