One of the most valuable talks of the day for me was from Fernando Chirigati from New York University. He introduced us to a useful new tool called ReproZip. He made the point that the computational environment is as important as the data itself for the reproducibility of research data. This could include information about libraries used, environment variables and options. You can not expect your depositors to find or document all of the dependencies (or your future users to install them). What ReproZip does is package up all the necessary dependencies along with the data itself. This package can then be archived and re-used in the future. ReproZip can also be used to unpack and re-use the data in the future. I can see a very real use case for this for researchers within our institution.
Workflow is a well-established means by which to capture scientific methods in an abstract graph of interrelated processing tasks. The reproducibility of scientific workflows is therefore fundamental to reproducible e-Science. However, the ability to record all the required details so as to make a workflow fully reproducible is a long-standing problem that is very difficult to solve. In this paper, we introduce an approach that integrates system description, source control, container management and automatic deployment techniques to facilitate workflow reproducibility. We have developed a framework that leverages this integration to support workflow execution, re-execution and reproducibility in the cloud and in a personal computing environment. We demonstrate the effectiveness of our approach by ex-amining various aspects of repeatability and reproducibility on real scientific workflows. The framework allows workflow andtask images to be captured automatically, which improves not only repeatability but also runtime performance. It also gives workflows portability across different cloud environments. Finally, the framework can also track changes in the development of tasks and workflows to protect them from unintentional failures.
Crowdsourcing is a multidisciplinary research area in-cluding disciplines like artificial intelligence, human-computer interaction, database, and social science. One of the main objectives of AAAI HCOMP conferences is to bring together researchers from different fields and provide them opportunities to exchange ideas and share new research results. To facilitate cooperation across disciplines,repro-ducibilityis a crucial factor, but unfortunately it has not got-ten enough attention in the HCOMP community.
A new version of ReproZip has been released, adding some bugfixes and new commands related to distributed or server experiments.
Remi Rampin and Fernando Chirigati of NYU will be presenting ReproZip at this year's SIGMOD ACM conference. ReproZip enables a researcher to create a compendium of his/her Linux experiment by automatically tracking and identifying all its required dependencies (data files, libraries, configuration files, etc.).
A ReproZip demo has been accepted at SIGMOD 2016: "ReproZip: Computational Reproducibility With Ease." F. Chirigati, R. Rampin, D. Shasha, and J. Freire.