Reproducibility in subsurface geoscience

Reproducibility, the extent to which consistent results are obtained when an experiment or study is repeated, sits at the foundation of science. The aim of this process is to produce robust findings and knowledge, with reproducibility being the screening tool to benchmark how well we are implementing the scientific method. However, the re-examination of results from many disciplines has caused significant concern as to the reproducibility of published findings. This concern is well-founded – our ability to independently reproduce results build trust both within the scientific community, between scientists and the politicians charged with translating research findings into public policy, and the general public. Within geoscience, discussions and practical frameworks for reproducibility are in their infancy, particularly in subsurface geoscience, an area where there are commonly significant uncertainties related to data (e.g. geographical coverage). Given the vital role of subsurface geoscience as part of sustainable development pathways and in achieving Net Zero, such as for carbon capture storage, mining, and natural hazard assessment, there is likely to be an increased scrutiny on the reproducibility of geoscience results. We surveyed 347 Earth scientists from a broad section of academia, government, and industry to understand their experience and knowledge of reproducibility in the subsurface. More than 85% of respondents recognised there is a reproducibility problem in subsurface geoscience, with >90% of respondents viewing conceptual biases as having a major impact on the robustness of their findings and overall quality of their work. Access to data, undocumented methodologies, and confidentiality issues (e.g. use of proprietary data and methods) were identified as major barriers to reproducing published results. Overall, the survey results suggest a need for funding bodies, data providers, research groups, and publishers to build a framework and set of minimum standards for increasing the reproducibility of, and political and public trust in, the results of subsurface studies.

Immediate Feedback for Students to Solve Notebook Reproducibility Problems in the Classroom

Jupyter notebooks have gained popularity in educational settings. In France, it is one of the tools used by teachers in post-secondary classes to teach programming. When students complete their assignments, they send their notebooks to the teacher for feedback or grading. However, the teacher may not be able to reproduce the results contained in the notebooks. Indeed, students rely on the non-linearity of notebooks to write and execute code cells in an arbitrary order. Conversely, teachers are not aware of this implicit execution order and expect to reproduce the results by running the cells linearly from top to bottom. These two modes of usage conflict, making it difficult for teachers to evaluate their students' work. This article investigates the use of immediate visual feedback to alleviate the issue of non-reproducibility of students' notebooks. We implemented a Jupyter plug-in called Notebook Reproducibility Monitor (NoRM) that pinpoints the non-reproducible cells of a notebook under modifications. To evaluate the benefits of this approach, we perform a controlled study with 37 students on a programming assignment, followed by a focus group. Our results show that the plug-in significantly improves the reproducibility of notebooks without sacrificing the productivity of students.

Capturing and semantically describing provenance to tell the story of R scripts

Reproducibility is a topic that has received significant attention in recent years. Despite being considered a fundamental factor in the scientific process, recent surveys have shown the difficulty of reproducing already published works, which impacts scientists’ ability to verify, validate, and reuse research findings. Recording provenance data is one of the approaches that can help to mitigate the challenges involved in the reproducibility process. When semantically well defined, provenance can describe the entire process involved in producing a given result. Additionally, the use of semantic web technologies can allow for the provenance data to be machine-actionable. With a focus on computational experiments, this work presents a package for collecting and describing provenance data from R scripts using the REPRODUCE-ME ontology to describe the path taken to produce results. We describe the package implementation process and demonstrate how it can help describe the story of experiments defined as R scripts to support reproducibility.

Improving rigor and reproducibility in nonhuman primate research

Nonhuman primates (NHPs) are a critical component of translational/preclinical biomedical research due to the strong similarities between NHP and human physiology and disease pathology. In some cases, NHPs represent the most appropriate, or even the only, animal model for complex metabolic, neurological, and infectious diseases. The increased demand for and limited availability of these valuable research subjects requires that rigor and reproducibility be a prime consideration to ensure the maximal utility of this scarce resource. Here, we discuss a number of approaches that collectively can contribute to enhanced rigor and reproducibility in NHP research.

Reproducibility Study: Comparing Rewinding and Fine-tuning in Neural Network Pruning

Scope of reproducibility: We are reproducing Comparing Rewinding and Fine-tuning in Neural Networks from arXiv:2003.02389. In this work the authors compare three different approaches to retraining neural networks after pruning: 1) fine-tuning, 2) rewinding weights as in arXiv:1803.03635 and 3) a new, original method involving learning rate rewinding, building upon Lottery Ticket Hypothesis. We reproduce the results of all three approaches, but we focus on verifying their approach, learning rate rewinding, since it is newly proposed and is described as a universal alternative to other methods. We used CIFAR10 for most reproductions along with additional experiments on the larger CIFAR100, which extends the results originally provided by the authors. We have also extended the list of tested network architectures to include Wide ResNets. The new experiments led us to discover the limitations of learning rate rewinding which can worsen pruning results on large architectures. Results: We were able to reproduce the exact results reported by the authors in all originally reported scenarios. However, extended results on larger Wide Residual Networks have demonstrated the limitations of the newly proposed learning rate rewinding -- we observed a previously unreported accuracy degradation for low sparsity ranges. Nevertheless, the general conclusion of the paper still holds and was indeed reproduced.

The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results

The NLP field has recently seen a substantial increase in work related to reproducibility of results, and more generally in recognition of the importance of having shared definitions and practices relating to evaluation. Much of the work on reproducibility has so far focused on metric scores, with reproducibility of human evaluation results receiving far less attention. As part of a research programme designed to develop theory and practice of reproducibility assessment in NLP, we organised the first shared task on reproducibility of human evaluations, ReproGen 2021. This paper describes the shared task in detail, summarises results from each of the reproduction studies submitted, and provides further comparative analysis of the results. Out of nine initial team registrations, we received submissions from four teams. Meta-analysis of the four reproduction studies revealed varying degrees of reproducibility, and allowed very tentative first conclusions about what types of evaluation tend to have better reproducibility.