Whenever high-performance computing applications meet data-intensive scalable systems, an attractive approach is the use of Apache Spark for the management of scientific workflows. Spark provides several advantages such as being widely supported and granting efficient in-memory data management for large-scale applications. However, Spark still lacks support for data tracking and workflow provenance. Additionally, Spark’s memory management requires accessing all data movements between the workflow activities. Therefore, the running of legacy programs on Spark is interpreted as a "black-box" activity, which prevents the capture and analysis of implicit data movements. Here, we present SAMbA, an Apache Spark extension for the gathering of prospective and retrospective provenance and domain data within distributed scientific workflows. Our approach relies on enveloping both RDD structure and data contents at runtime so that (i) RDD-enclosure consumed and produced data are captured and registered by SAMbA in a structured way, and (ii) provenance data can be queried during and after the execution of scientific workflows. By following the W3C PROV representation, we model the roles of RDD regarding prospective and retrospective provenance data. Our solution provides mechanisms for the capture and storage of provenance data without jeopardizing Spark’s performance. The provenance retrieval capabilities of our proposal are evaluated in a practical case study, in which data analytics are provided by several SAMbA parameterizations.
There has been a recent major upsurge in the concerns about reproducibility in many areas of science. Within the neuroimaging domain, one approach is to promote reproducibility is to target the re-executability of the publication. The information supporting such re-executability can enable the detailed examination of how an initial finding generalizes across changes in the processing approach, and sampled population, in a controlled scientific fashion. ReproNim: A Center for Reproducible Neuroimaging Computation is a recently funded initiative that seeks to facilitate the ‘last mile’ implementations of core re-executability tools in order to reduce the accessibility barrier and increase adoption of standards and best practices at the neuroimaging research laboratory level. In this report, we summarize the overall approach and tools we have developed in this domain.
Increasingly complex statistical models are being used for the analysis of biological data. Recent commentary has focused on the ability to compute the same outcome for a given dataset (reproducibility). We argue that a reproducible statistical analysis is not necessarily valid because of unique patterns of nonindependence in every biological dataset. We advocate that analyses should be evaluated with known-truth simulations that capture biological reality, a process we call “analysis validation.” We review the process of validation and suggest criteria that a validation project should meet. We find that different fields of science have historically failed to meet all criteria, and we suggest ways to implement meaningful validation in training and practice.
WINGS enables researchers to submit complete semantic workflows as challenge submissions. By submitting entries as workflows, it then becomes possible to compare not just the results and performance of a challenger, but also the methodology employed. This is particularly important when dozens of challenge entries may use nearly identical tools, but with only subtle changes in parameters (and radical differences in results). WINGS uses a component driven workflow design and offers intelligent parameter and data selectionby reasoning aboutdata characteristics.
Presentation from Singapore Meeting on Research Integrity Reproducibility: Research integrity but much, much more.
Mechanobiology, the study of how mechanical forces affect cellular behavior, is an emerging field of study that has garnered broad and significant interest. Researchers are currently seeking to better understand how mechanical signals are transmitted, detected, and integrated at a subcellular level. One tool for addressing these questions is a Förster resonance energy transfer (FRET)‐based tension sensor, which enables the measurement of molecular‐scale forces across proteins based on changes in emitted light. However, the reliability and reproducibility of measurements made with these sensors has not been thoroughly examined. To address these concerns, we developed numerical methods that improve the accuracy of measurements made using sensitized emission‐based imaging. To establish that FRET‐based tension sensors are versatile tools that provide consistent measurements, we used these methods, and demonstrated that a vinculin tension sensor is unperturbed by cell fixation, permeabilization, and immunolabeling. This suggests FRET‐based tension sensors could be coupled with a variety of immuno‐fluorescent labeling techniques. Additionally, as tension sensors are frequently employed in complex biological samples where large experimental repeats may be challenging, we examined how sample size affects the uncertainty of FRET measurements. In total, this work establishes guidelines to improve FRET‐based tension sensor measurements, validate novel implementations of these sensors, and ensure that results are precise and reproducible.