Using Provenance for Generating Automatic Citations

When computational experiments include only datasets, they could be shared through the Uniform Resource Identifiers (URIs) or Digital Object Identifiers (DOIs) which point to these resources. However, experiments seldom include only datasets, but most often also include software, execution results, provenance, and other associated documentation. The Research Object has recently emerged as a comprehensive and systematic method for aggregation and identification of diverse elements of computational experiments. While an entire Research Object may be citable using a URI or a DOI, it is often desirable to cite specific sub-components of a research object to help identify, authorize, date, and retrieve the published sub-components of these objects. In this paper, we present an approach to automatically generate citations for sub-components of research objects by using the object’s recorded provenance traces. The generated citations can be used "as is" or taken as suggestions that can be grouped and combined to produce higher level citations.

Variable Bibliographic Database Access Could Limit Reproducibility

Bibliographic databases provide access to scientific literature through targeted queries. The most common uses of these services, aside from accessing scientific literature for personal use, are to find relevant citations for formal surveys of scientific literature, such as systematic reviews or meta-analysis, or to estimate the number of publications on a certain topic as a measure of sampling effort. Bibliographic search tools vary in the level of access to the scientific literature they allow. For instance, Google Scholar is a bibliographic search engine which allows users to find (but not necessarily access) scientific literature for no charge, whereas other services, such as Web of Science, are subscription based, allowing access to full texts of academic works at costs that can exceed $100,000 annually for large universities (Goodman 2005). One of the most commonly used bibliographic databases, Clarivate Analytics–produced Web of Science, offers tailored subscriptions to their citation indexing service. This flexibility allows subscriptions and resulting access to be tailored to the needs of researchers at the institution (Goodwin 2014). However, there are issues created by this differential access, which we discuss further below.

Software Reproducibility: How to put it into practice?

On 24 May 2018, Maria Cruz, Shalini Kurapati, and Yasemin Türkyilmaz-van der Velden led a workshop titled “Software Reproducibility: How to put it into practice?”, as part of the event Towards cultural change in data management - data stewardship in practice held at TU Delft, the Netherlands. There were 17 workshop participants, including researchers, data stewards, and research software engineers. Here we describe the rationale of the workshop, what happened on the day, key discussions and insights, and suggested next steps.

DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis

A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.

How to Read a Research Compendium

Researchers spend a great deal of time reading research papers. Keshav (2012) provides a three-pass method to researchers to improve their reading skills. This article extends Keshav's method for reading a research compendium. Research compendia are an increasingly used form of publication, which packages not only the research paper's text and figures, but also all data and software for better reproducibility. We introduce the existing conventions for research compendia and suggest how to utilise their shared properties in a structured reading process. Unlike the original, this article is not build upon a long history but intends to provide guidance at the outset of an emerging practice.

Teaching computational reproducibility for neuroimaging

We describe a project-based introduction to reproducible and collaborative neuroimaging analysis. Traditional teaching on neuroimaging usually consists of a series of lectures that emphasize the big picture rather than the foundations on which the techniques are based. The lectures are often paired with practical workshops in which students run imaging analyses using the graphical interface of specific neu-roimaging software packages. Our experience suggests that this combination leaves the student with asuperficial understanding of the underlying ideas, and an informal, inefficient, and inaccurate approach to analysis. To address these problems, we based our course around a substantial open-ended group project. This allowed us to teach: (a) computational tools to ensure computationally reproducible work,such as the Unix command line, structured code, version control, automated testing, and code reviewand (b) a clear understanding of the statistical techniques used for a basic analysis of a single run in an MRI scanner. The emphasis we put on the group project showed the importance of standard computational tools for accuracy, efficiency, and collaboration. The projects were broadly successful in engagingstudents in working reproducibly on real scientific questions. We propose that a course on this modelshould be the foundation for future programs in neuroimaging. We believe it will also serve as a modelfor teaching efficient and reproducible research in other fields of computational science