Supporting accessibility and reproducibility in language research in the Alveo virtual laboratory

Reproducibility is an important part of scientific research and studies published in speech and language research usually make some attempt at ensuring that the work reported could be reproduced by other researchers. This paper looks at the current practice in the field relating to the citation and availability of both data and software methods. It is common to use widely available shared datasets in this field which helps to ensure that studies can be reproduced; however a brief survey of recent papers shows a wide range of styles of citation of data only some of which clearly identify the exact data used in the study. Similarly, practices in describing and sharing software artefacts vary considerably from detailed descriptions of algorithms to linked repositories. The Alveo Virtual Laboratory is a web based platform to support research based on collections of text, speech and video. Alveo provides a central repository for language data and provides a set of services for discovery and analysis of data. We argue that some of the features of the Alveo platform may make it easier for researchers to share their data more precisely and cite the exact software tools used to develop published results. Alveo makes use of ideas developed in other areas of science and we discuss these and how they can be applied to speech and language research.

Challenges of archiving and preserving born-digital news applications

Born-digital news content is increasingly becoming the format of the first draft of history. Archiving and preserving this history is of paramount importance to the future of scholarly research, but many technical, legal, financial, and logistical challenges stand in the way of these efforts. This is especially true for news applications, or custom-built websites that comprise some of the most sophisticated journalism stories today, such as the “Dollars for Docs” project by ProPublica. Many news applications are standalone pieces of software that query a database, and this significant subset of apps cannot be archived in the same way as text-based news stories, or fully captured by web archiving tools such as Archive-It. As such, they are currently disappearing. This paper will outline the various challenges facing the archiving and preservation of born-digital news applications, as well as outline suggestions for how to approach this important work.

Facilitating reproducible research by investigating computational metadata

Computational workflows consist of a series of steps in which data is generated, manipulated, analysed and transformed. Researchers use tools and techniques to capture the provenance associated with the data to aid reproducibility. The metadata collected not only helps in reproducing the computation but also aids in comparing the original and reproduced computations. In this paper, we present an approach, "Why-Diff", to analyse the difference between two related computations by changing the artifacts and how the existing tools "YesWorkflow" and "NoWorkflow" record the changed artifacts.

Using Docker Containers to Extend Reproducibility Architecture for the NASA Earth Exchange (NEX)

NASA Earth Exchange (NEX) is a data, supercomputing and knowledge collaboratory that houses NASA satellite, climate and ancillary data where a focused community can come together to address large-scale challenges in Earth sciences. As NEX has been growing into a petabyte-size platform for analysis, experiments and data production, it has been increasingly important to enable users to easily retrace their steps, identify what datasets were produced by which process chains, and give them ability to readily reproduce their results. This can be a tedious and difficult task even for a small project, but is almost impossible on large processing pipelines. We have developed an initial reproducibility and knowledge capture solution for the NEX, however, if users want to move the code to another system, whether it is their home institution cluster, laptop or the cloud, they have to find, build and install all the required dependencies that would run their code. This can be a very tedious and tricky process and is a big impediment to moving code to data and reproducibility outside the original system. The NEX team has tried to assist users who wanted to move their code into OpenNEX on Amazon cloud by creating custom virtual machines with all the software and dependencies installed, but this, while solving some of the issues, creates a new bottleneck that requires the NEX team to be involved with any new request, updates to virtual machines and general maintenance support. In this presentation, we will describe a solution that integrates NEX and Docker to bridge the gap in code-to-data migration. The core of the solution is saemi-automatic conversion of science codes, tools and services that are already tracked and described in the NEX provenance system, to Docker - an open-source Linux container software. Docker is available on most computer platforms, easy to install and capable of seamlessly creating and/or executing any application packaged in the appropriate format. We believe this is an important step towards seamless process deployment in heterogeneous environments that will enhance community access to NASA data and tools in a scalable way, promote software reuse, and improve reproducibility of scientific results.

Computational Analysis of Lifespan Experiment Reproducibility

Independent reproducibility is essential to the generation of scientific knowledge. Optimizing experimental protocols to ensure reproducibility is an important aspect of scientific work. Genetic or pharmacological lifespan extensions are generally small compared to the inherent variability in mean lifespan even in isogenic populations housed under identical conditions. This variability makes reproducible detection of small but real effects experimentally challenging. In this study, we aimed to determine the reproducibility of C. elegans lifespan measurements under ideal conditions, in the absence of methodological errors or environmental or genetic background influences. To accomplish this, we generated a parametric model of C. elegans lifespan based on data collected from 5,026 wild-type N2 animals. We use this model to predict how different experimental practices, effect sizes, number of animals, and how different ‘shapes’ of survival curves affect the ability to reproduce real longevity effects. We find that the chances of reproducing real but small effects are exceedingly low and would require substantially more animals than are commonly used. Our results indicate that many lifespan studies are underpowered to detect reported changes and that, as a consequence, stochastic variation alone can account for many failures to reproduce longevity results. As a remedy, we provide power of detection tables that can be used as guidelines to plan experiments with statistical power to reliably detect real changes in lifespan and limit spurious false positive results. These considerations will improve best-practices in designing lifespan experiment to increase reproducibility.