Toward Enabling Reproducibility for Data-Intensive Research Using the Whole Tale Platform

Whole Tale http://wholetale.org is a web-based, open-source platform for reproducible research supporting the creation, sharing, execution, and verification of "Tales" for the scientific research community. Tales are executable research objects that capture the code, data, and environment along with narrative and workflow information needed to re-create computational results from scientific studies. Creating reproducible research objects that enable reproducibility, transparency, and re-execution for computational experiments requiring significant compute resources or utilizing massive data is an especially challenging open problem. We describe opportunities, challenges, and solutions to facilitating reproducibility for data-and compute-intensive research, that we call "Tales at Scale," using the Whole Tale computing platform.We highlight challenges and solutions in frontend responsiveness needs, gaps in current middleware design and implementation, network restrictions, containerization, and data access. Finally, we discuss challenges in packaging computational experiment implementations for portable data-intensive Tales and outline future work.

Assessing the impact of introductory programming workshops on the computational reproducibility of biomedical workflows

Introduction: As biomedical research becomes more data-intensive, computational reproducibility is a growing area of importance. Unfortunately, many biomedical researchers have not received formal computational training and often struggle to produce results that can be reproduced using the same data, code, and methods. Programming workshops can be a tool to teach new computational methods, but it is not always clear whether researchers are able to use their new skills to make their work more computationally reproducible. Methods: This mixed methods study consisted of in-depth interviews with 14 biomedical researchers before and after participation in an introductory programming workshop. During the interviews, participants described their research workflows and responded to a quantitative checklist measuring reproducible behaviors. The interview data was analyzed using a thematic analysis approach, and the pre and post workshop checklist scores were compared to assess the impact of the workshop on computational reproducibility of the researchers' workflows. Results: Pre and post scores on a checklist of reproducible behaviors did not increase in a statistically significant manner. The qualitative interviews revealed that several participants had made small changes to their workflows including switching to open source programming languages for their data cleaning, analysis, and visualization. Overall many of the participants indicated higher levels of programming literacy and an interest in further training. Factors that enabled change included supportive environments and an immediate research need, while barriers included collaborators that were resistant to new tools and a lack of time. Conclusion: While none of the participants completely changed their workflows, many of them did incorporate new practices, tools, or methods that helped make their work more reproducible and transparent to other researchers. This indicate that programming workshops now offered by libraries and other organizations contribute to computational reproducibility training for researchers

A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility

Data makes science possible. Sharing data improves visibility, and makes the research process transparent. This increases trust in the work, and allows for independent reproduction of results. However, a large proportion of data from published research is often only available to the original authors. Despite the obvious benefits of sharing data, and scientists' advocating for the importance of sharing data, most advice on sharing data discusses its broader benefits, rather than the practical considerations of sharing. This paper provides practical, actionable advice on how to actually share data alongside research. The key message is sharing data falls on a continuum, and entering it should come with minimal barriers.

Leveraging Container Technologies in a GIScience Project: A Perspective from Open Reproducible Research

Scientific reproducibility is essential for the advancement of science. It allows the results of previous studies to be reproduced, validates their conclusions and develops new contributions based on previous research. Nowadays, more and more authors consider that the ultimate product of academic research is the scientific manuscript, together with all the necessary elements (i.e., code and data) so that others can reproduce the results. However, there are numerous difficulties for some studies to be reproduced easily (i.e., biased results, the pressure to publish, and proprietary data). In this context, we explain our experience in an attempt to improve the reproducibility of a GIScience project. According to our project needs, we evaluated a list of practices, standards and tools that may facilitate open and reproducible research in the geospatial domain, contextualising them on Peng’s reproducibility spectrum. Among these resources, we focused on containerisation technologies and performed a shallow review to reflect on the level of adoption of these technologies in combination with OSGeo software. Finally, containerisation technologies proved to enhance the reproducibility and we used UML diagrams to describe representative work-flows deployed in our GIScience project.

Publishing computational research -- A review of infrastructures for reproducible and transparent scholarly communication

Funding agencies increasingly ask applicants to include data and software management plans into proposals. In addition, the author guidelines of scientific journals and conferences more often include a statement on data availability, and some reviewers reject unreproducible submissions. This trend towards open science increases the pressure on authors to provide access to the source code and data underlying the computational results in their scientific papers. Still, publishing reproducible articles is a demanding task and not achieved simply by providing access to code scripts and data files. Consequently, several projects develop solutions to support the publication of executable analyses alongside articles considering the needs of the aforementioned stakeholders. The key contribution of this paper is a review of applications addressing the issue of publishing executable computational research results. We compare the approaches across properties relevant for the involved stakeholders, e.g., provided features and deployment options, and also critically discuss trends and limitations. The review can support publishers to decide which system to integrate into their submission process, editors to recommend tools for researchers, and authors of scientific papers to adhere to reproducibility principles.