This data article introduces a reproducibility dataset with the aim of allowing the exact replication of all experiments, results and data tables introduced in our companion paper (Lastra-Díaz et al., 2019), which introduces the largest experimental survey on ontology-based semantic similarity methods and Word Embeddings (WE) for word similarity reported in the literature. The implementation of all our experiments, as well as the gathering of all raw data derived from them, was based on the software implementation and evaluation of all methods in HESML library (Lastra-Díaz et al., 2017), and their subsequent recording with Reprozip (Chirigati et al., 2016). Raw data is made up by a collection of data files gathering the raw word-similarity values returned by each method for each word pair evaluated in any benchmark. Raw data files was processed by running a R-language script with the aim of computing all evaluation metrics reported in (Lastra-Díaz et al., 2019), such as Pearson and Spearman correlation, harmonic score and statistical significance p-values, as well as to generate automatically all data tables shown in our companion paper. Our dataset provides all input data files, resources and complementary software tools to reproduce from scratch all our experimental data, statistical analysis and reported data. Finally, our reproducibility dataset provides a self-contained experimentation platform which allows to run new word similarity benchmarks by setting up new experiments including other unconsidered methods or word similarity benchmarks.
Reproducibility in the computational sciences has been stymied because of the complex and rapidly changing computational environments in which modern research takes place. While many will espouse reproducibility as a value, the challenge of making it happen (both for themselves and testing the reproducibility of others' work) often outweigh the benefits. There have been a few reproducibility solutions designed and implemented by the community. In particular, the authors are contributors to ReproZip, a tool to enable computational reproducibility by tracing and bundling together research in the environment in which it takes place (e.g. one's computer or server). In this white paper, we introduce a tool for unpacking ReproZip bundles in the cloud, ReproServer. ReproServer takes an uploaded ReproZip bundle (.rpz file) or a link to a ReproZip bundle, and users can then unpack them in the cloud via their browser, allowing them to reproduce colleagues' work without having to install anything locally. This will help lower the barrier to reproducing others' work, which will aid reviewers in verifying the claims made in papers and reusing previously published research.
Achieving research reproducibility is challenging in many ways: there are social and cultural obstacles as well as a constantly changing technical landscape that makes replicating and reproducing research difficult. Users face challenges in reproducing research across different operating systems, in using different versions of software across long projects and among collaborations, and in using publicly available work. The dependencies required to reproduce the computational environments in which research happens can be exceptionally hard to track – in many cases, these dependencies are hidden or nested too deeply to discover, and thus impossible to install on a new machine, which means adoption remains low. In this paper, we present ReproZip, an open source tool to help overcome the technical difficulties involved in preserving and replicating research, applications, databases, software, and more. We examine the current use cases of ReproZip, ranging from digital humanities to machine learning. We also explore potential library use cases for ReproZip, particularly in digital libraries and archives, liaison librarianship, and other library services. We believe that libraries and archives can leverage ReproZip to deliver more robust reproducibility services, repository services, as well as enhanced discoverability and preservation of research materials, applications, software, and computational environments.
Over the past few years, research reproducibility has been increasingly highlighted as a multifaceted challenge across many disciplines. There are socio-cultural obstacles as well as a constantly changing technical landscape that make replicating and reproducing research extremely difficult. Researchers face challenges in reproducing research across different operating systems and different versions of software, to name just a few of the many technical barriers. The prioritization of citation counts and journal prestige has undermined incentives to make research reproducible. While libraries have been building support around research data management and digital scholarship, reproducibility is an emerging area that has yet to be systematically addressed. To respond to this, New York University (NYU) created the position of Librarian for Research Data Management and Reproducibility (RDM & R), a dual appointment between the Center for Data Science (CDS) and the Division of Libraries. This report will outline the role of the RDM & R librarian, paying close attention to the collaboration between the CDS and Libraries to bring reproducible research practices into the norm.
A preview of ReproZip by Vicky Steeves at the PresQT Workshop May 1, 2017 at the University of Notre Dame.
Introduce the python environment wrapper and packing tools; virtualenv & pip. Show you how you can stay up to date by using in requires.io egg security and update checking. Cover Fabric a python deployment tool and wider systems and workflow replication with Vagrant and Reprozip.If time allowing touch upon test driven development and adding Travis to your project.