One of the most valuable talks of the day for me was from Fernando Chirigati from New York University. He introduced us to a useful new tool called ReproZip. He made the point that the computational environment is as important as the data itself for the reproducibility of research data. This could include information about libraries used, environment variables and options. You can not expect your depositors to find or document all of the dependencies (or your future users to install them). What ReproZip does is package up all the necessary dependencies along with the data itself. This package can then be archived and re-used in the future. ReproZip can also be used to unpack and re-use the data in the future. I can see a very real use case for this for researchers within our institution.
Workflow is a well-established means by which to capture scientific methods in an abstract graph of interrelated processing tasks. The reproducibility of scientific workflows is therefore fundamental to reproducible e-Science. However, the ability to record all the required details so as to make a workflow fully reproducible is a long-standing problem that is very difficult to solve. In this paper, we introduce an approach that integrates system description, source control, container management and automatic deployment techniques to facilitate workflow reproducibility. We have developed a framework that leverages this integration to support workflow execution, re-execution and reproducibility in the cloud and in a personal computing environment. We demonstrate the effectiveness of our approach by ex-amining various aspects of repeatability and reproducibility on real scientific workflows. The framework allows workflow andtask images to be captured automatically, which improves not only repeatability but also runtime performance. It also gives workflows portability across different cloud environments. Finally, the framework can also track changes in the development of tasks and workflows to protect them from unintentional failures.
Publishing scientific results without the detailed execution environments describing how the results were collected makes it difficult or even impossible for the reader to reproduce thework. However, the configurations of the execution environ-ments are too complex to be described easily by authors. To solve this problem, we propose a framework facilitating the conduct of reproducible research by tracking, creating, and preserving the comprehensive execution environments with Umbrella. The framework includes a lightweight, persistent anddeployable execution environment specification, an execution engine which creates the specified execution environments, and an archiver which archives an execution environment into persistent storage services like Amazon S3 and Open Science Framework (OSF). The execution engine utilizes sandbox techniques like virtual machines (VMs), Linux containers and user-space tracers, to cre-ate an execution environment, and allows common dependencies like base OS images to be shared by sandboxes for different applications. We evaluate our framework by utilizing it to reproduce three scientific applications from epidemiology, scene rendering, and high energy physics. We evaluate the time and space overhead of reproducing these applications, and the effectiveness of the chosen archive unit and mounting mechanism for allowing different applications to share dependencies. Our results show that these applications can be reproduced using different sandbox techniques successfully and efficiently, even through the overhead andperformance slightly vary.
Computing as a whole suffers from a crisis of reproducibility. Programs executed in one context are aston-ishingly hard to reproduce in another context, resulting in wasted effort by people and general distrust of results produced by computer. The root of the problem lies in the fact that every program has implicit dependencies on data and execution environment whichare rarely understood by the end user. To address this problem, we present PRUNE, the Preserving Run Environment.In PRUNE, every task to be executed is wrapped in a functional interface and coupled with a strictly defined environment. The task is then executed by PRUNErather than the user to ensure reproducibility. As a scientific workflow evolves in PRUNE, a growing but immutable tree of derived data is created. The provenance of every item in the system can be precisely described, facilitating sharing and modification between collaborating researchers, along with efficient management of limited storage space. We present the user interface and the initial prototype of PRUNE, and demonstrate its application in matching records and comparing surnames in U.S. Censuses.
You download the data and complete your analysis with ample time to spare. Then, just before deadline, your collaborator lets you know that they've "fixed a data error". Now, you have to do your analysis all over again. This is the reproducibility horror story.
If you are an R user it has probably happened to you that you upgraded some R package in your R installation, and then suddenly your R script or application stopped working. One strategy is that you create a new package library for a new project. A package library is just a directory that holds all installed R packages. (In addition to the ones that are installed with R itself.) This is why we created the pkgsnap tool. This is a very simple package with two exported functions: 1) snap takes a snapshot of your project library. It writes out the names and versions of the currently installed packages into a text file. You can put this text file into the version control repository of the project, to make sure it is not lost, and 2) restore uses the snapshot file to recreate the package project library from scratch. It installs the recorded versions of the recorded packages, in the right order.