In this dissertation I deal with the requirements and the analysis of the reproducibility. I set out methods based on provenance data to handle or eliminate the unavailable or changing descriptors in order to be able reproduce an – in other way – non-reproducible scientific workflow. In this way I intend to support the scientist’s community in designing and creating reproducible scientific workflows.
The concept of reproducibility is one of the foundations of scientific practice and the bedrock by which scientific validity can be established. However, the extent to which reproducibility is being achieved in the sciences is currently under question. Several studies have shown that much peer-reviewed scientific literature is not reproducible. One crucial contributor to the obstruction of reproducibility is the lack of transparency of original data and methods. Reproducibility, the ability of scientific results and conclusions to be independently replicated by independent parties, potentially using different tools and approaches, can only be achieved if data and methods are fully disclosed.
Reproducibility has become one of biology’s most pressing issues. This impasse has been fueled by the combined reliance on increasingly complex data analysis methods and the exponential growth of biological datasets. When considering the installation, deployment and maintenance of bioinformatic pipelines, an even more challenging picture emerges due to the lack of community standards. The effect of limited standards on reproducibility is amplified by the very diverse range of computational platforms and configurations on which these applications are expected to be applied (workstations, clusters, HPC, clouds, etc.). With no established standard at any level, diversity cannot be taken for granted.
This blog is based on part of a talk I gave in January 2017, and the thinking behind it, in turn, is based on my view of a series of recent talks and blogs, and how they might be fit together. The short summary is that general software reproducibly is hard at best, and may not be practical except in special cases.
Explore how Singularity liberates non-privileged users and host resources (such as interconnects, resource managers, file systems, accelerators …) allowing users to take full control to set-up and run in their native environments. This talk explores Singularity how it combines software packaging models with minimalistic containers to create very lightweight application bundles which can be simply executed and contained completely within their environment or be used to interact directly with the host file systems at native speeds. A Singularity application bundle can be as simple as containing a single binary application or as complicated as containing an entire workflow and is as flexible as you will need.
Researchers from the UW’s eScience Institute, New York University Center for Data Science and Berkeley Institute for Data Science (BIDS) have authored a new book titled The Practice of Reproducible Research. Representatives from the three universities, all Moore-Sloan Data Science Environments partners, joined on January 27, 2017, at a symposium hosted by BIDS. There, speakers discussed the book’s content, including case studies, lessons learned and the potential future of reproducible research practices.