We replicate and analyze the topic model which was commissioned to King’s College and Digital Science for the Research Evaluation Framework (REF 2014) in the United Kingdom: 6,638 case descriptions of societal impact were submitted by 154 higher-education institutes. We compare the Latent Dirichlet Allocation (LDA) model with Principal Component Analysis (PCA) of document-term matrices using the same data. Since topic models are almost by definition applied to text corpora which are too large to read, validation of the results of these models is hardly possible; furthermore the models are irreproducible for a number of reasons. However, removing a small fraction of the documents from the sample—a test for reliability—has on average a larger impact in terms of decay on LDA than on PCA-based models. The semantic coherence of LDA models outperforms PCA-based models. In our opinion, results of the topic models are statistical and should not be used for grant selections and micro decision-making about research without follow-up using domain-specific semantic maps.
The pressures of a scientific career can end up incentivising an all‐or‐nothing approach to cross the finish line first. While competition can be healthy and drives innovation, the current system fails to encourage scientists to work reproducibility. This sometimes leaves those individuals who come second to correct mistakes in published research without being rewarded. Instead, we need a culture that rewards reproducibility and holds it as important as the novelty of the result. Here, I draw on my own journey in the oestrogen receptor research field to highlight this and suggest ways for the 'first past the post' culture to be challenged.
Although many works in the database community use open data in their experimental evaluation, repeating the empirical results of previous works remains a challenge. This holds true even if the source code or binaries of the tested algorithms are available. In this paper, we argue that providing access to the raw, original datasets is not enough. Real-world datasets are rarely processed without modification. Instead, the data is adapted to the needs of the experimental evaluation in the data preparation process. We showcase that the details of the data preparation process matter and subtle differences during data conversion can have a large impact on the outcome of runtime results. We introduce a data reproducibility model, identify three levels of data reproducibility, report about our own experience, and exemplify our best practices.
Reproducibility plays a crucial role in experimentation. However, the modern research ecosystem and the underlying frameworks are constantly evolving and thereby making it extremely difficult to reliably reproduce scientific artifacts such as data, algorithms, trained models and visual-izations. We therefore aim to design a novel system for assisting data scientists with rigorous end-to-end documentation of data-oriented experiments. Capturing data lineage, metadata, andother artifacts helps reproducing and sharing experimental results. We summarize this challenge as automated documentation of data science experiments. We aim at reducing manualoverhead for experimenting researchers, and intend to create a novel approach in dataflow and metadata tracking based on the analysis of the experiment source code. The envisioned system will accelerate the research process in general, andenable capturing fine-grained meta information by deriving a declarative representation of data science experiments.
Lee et al. (2019) make several practical recommendations for replicable, useful cognitive modeling. They also point out that the ultimate test of the usefulness of a cognitive model is its ability to solve practical problems. In this commentary, we argue that for cognitive modeling to reach applied domains, there is a pressing need to improve the standards of transparency and reproducibility in cognitive modelling research. Solution-oriented modeling requires engaging practitioners who understand the relevant domain. We discuss mechanisms by which reproducible research can foster engagement with applied practitioners. Notably, reproducible materials provide a start point for practitioners to experiment with cognitive models and determine whether those models might be suitable for their domain of expertise. This is essential because solving complex problems requires exploring a range of modeling approaches, and there may not time to implement each possible approach from the ground up. We also note the broader benefits to reproducibility within the field.
Preparation of data sets for analysis is a critical component of research in many disciplines. Recording the steps taken to clean data sets is equally crucial if such research is to be transparent and results reproducible. OpenRefine is a tool for interactively cleaning data sets via a spreadsheet-like interface and for recording the sequence of operations carried out by the user. OpenRefine uses its operation history to provide an undo/redo capability that enables a user to revisit the state of the data set at any point in the data cleaning process. OpenRefine additionally allows the user to export sequences of recorded operations as recipes that can be applied later to different data sets. Although OpenRefine internally records details about every change made to a data set following data import, exported recipes do not include the initial data import step. Details related to parsing the original data files are not included. Moreover, exported recipes do not include any edits made manually to individual cells. Consequently, neither a single recipe, nor a set of recipes exported by OpenRefine, can in general represent an entire, end-to-end data preparation workflow. Here we report early results from an investigation into how the operation history recorded by OpenRefine can be used to (1) facilitate reproduction of complete, real-world data cleaning workflows; and (2) support queries and visualizations of the provenance of cleaned data sets for easy review.