Reproducibility in Big Data with the repro package

Tue-01

Presentation time:

Peikert, Aaron and Brandmaier, Andreas M.

Center for Lifespan Psychology, Max Planck Institute for Human Development, Berlin,Germany

Background

The rules of “good scientific practice” mandate that a research artefact is reproducible. Reproducibility is ensured if the same results can be obtained with the same data from the same analysis. Big data applications are threatened by non-reproducibility because data often come from multiple sources, are large and messy, and preprocessing may rely on various software packages. Then, it becomes increasingly difficult to track and document all steps of an analysis pipeline and guarantee their reproducibility.

Objectives

Big data are typically characterised by volume, variety, and velocity. Increased volume implies the need for distributed computing. Variety of data sources requires us to pay close attention to how data objects flow through our analysis. Velocity demands from our results to be updated dynamically. Reproduction in times of big data can hence no longer be a manual task for human researchers but must be supported by computer tools. Four concepts are necessary to meet these demands:

Software management allows for distributed computing,
dependency tracking coordinates the data flow,
dynamic document creation keeps the results consistent and up-to-date, and
version control to track changes over time

Approach

Increasing the degree of automation is crucial to ensure reproducibility in big data applications. We propose to adapt and apply tools in research contexts that were originally meant for software development:

Software management with Docker ensures a stable software environment across changing computing environments—even across thousands of nodes in distributed environments,
dependency tracking with Make documents and automates complex processing pipelines,
dynamic document creation with RMarkdown helps to recreate manuscripts describing results, and
version control with Git tracks snapshots of the analysis workflow over time.

While these tools have proven to be effective, their origin in software engineering requires a steep learning curve for researchers, who are typically not trained in using them. We believe a layer of abstraction may ease access to these tools and their merits for reproducibility. The R package repro wraps everyday tasks into composable building blocks. repro makes it easier to follow best practices by automatically configuring the needed tools.

Implications

To meet the demands of reproducibility in big data, the research community must move to a structured and automated approach. This will require researchers to adopt new tools and workflows. This investment upfront will pay for itself through more robust analyses that scale better, are well structured, and allow future users to recreate past results. Last, following the proposed best practices of reproducibility has benefits for collaboration of multiple authors on a single analysis.