SciPipe: A workflow library for agile development ofcomplex and dynamic bioinformatics pipelines

Background:The complex nature of biological data has driven the development of specialized software tools. Scientificworkflow management systems simplify the assembly of such tools into pipelines, assist with job automation, and aidreproducibility of analyses. Many contemporary workflow tools are specialized or not designed for highly complexworkflows, such as with nested loops, dynamic scheduling, and parametrization, which is common in, e.g., machinelearning.Findings:SciPipe is a workflow programming library implemented in the programming language Go, for managingcomplex and dynamic pipelines in bioinformatics, cheminformatics, and other fields. SciPipe helps in particular withworkflow constructs common in machine learning, such as extensive branching, parameter sweeps, and dynamicscheduling and parametrization of downstream tasks. SciPipe builds on flow-based programming principles to supportagile development of workflows based on a library of self-contained, reusable components. It supports running subsets ofworkflows for improved iterative development and provides a data-centric audit logging feature that saves a full audit tracefor every output file of a workflow, which can be converted to other formats such as HTML, TeX, and PDF on demand. Theutility of SciPipe is demonstrated with a machine learning pipeline, a genomics, and a transcriptomics pipeline.Conclusions:SciPipe provides a solution for agile development of complex and dynamic pipelines, especially in machinelearning, through a flexible application programming interface suitable for scientists used to programming or scripting.

Survey on Scientific Shared Resource Rigor and Reproducibility

Shared scientific resources, also known as core facilities, support a significant portion of the research conducted at biomolecular research institutions. The Association of Biomolecular Resource Facilities (ABRF) established the Committee on Core Rigor and Reproducibility (CCoRRe) to further its mission of integrating advanced technologies, education, and communication in the operations of shared scientific resources in support of reproducible research. In order to first assess the needs of the scientific shared resource community, the CCoRRe solicited feedback from ABRF members via a survey. The purpose of the survey was to gain information on how U.S. National Institutes of Health (NIH) initiatives on advancing scientific rigor and reproducibility influenced current services and new technology development. In addition, the survey aimed to identify the challenges and opportunities related to implementation of new reporting requirements and to identify new practices and resources needed to ensure rigorous research. The results revealed a surprising unfamiliarity with the NIH guidelines. Many of the perceived challenges to the effective implementation of best practices (i.e., those designed to ensure rigor and reproducibility) were similarly noted as a challenge to effective provision of support services in a core setting. Further, most cores routinely use best practices and offer services that support rigor and reproducibility. These services include access to well-maintained instrumentation and training on experimental design and data analysis as well as data management. Feedback from this survey will enable the ABRF to build better educational resources and share critical best-practice guidelines. These resources will become important tools to the core community and the researchers they serve to impact rigor and transparency across the range of science and technology.

Foregrounding data curation to foster reproducibility of workflows and scientific data reuse

Scientific data reuse requires careful curation and annotation of the data. Late stage curation activities foster FAIR principles which include metadata standards for making data findable, accessible, interoperable and reusable. However, in scientific domains such as biomolecular nuclear magnetic resonance spectroscopy, there is a considerable time lag (usually more than a year) between data creation and data deposition. It is simply not feasible to backfill the required metadata so long after the data has been created (anything not carefully recorded is forgotten) – curation activities must begin closer to (if not at the point of) data creation. The need for foregrounding data curation activities is well known. However, scientific disciplines which rely on complex experimental design, sophisticated instrumentation, and intricate processing workflows, require extra care. The knowledge gap investigated by this research proposal is to identify classes of important metadata which are hidden within the tacit knowledge of a scientist when constructing an experiment, hidden within the operational specifications of the scientific instrumentation, and hidden within the design / execution of processing workflows. Once these classes of hidden knowledge have been identified, it will be possible to explore mechanisms for preventing the loss of key metadata, either through automated conversion from existing metadata or through curation activities at the time of data creation. The first step of the research plan is to survey artifacts of scientific data creation. That is, (i) existing data files with accompanying metadata, (ii) workflows and scripts for data processing, and (iii) documentation for software and scientific instrumentation. The second step is to group, categorize, and classify the types of "hidden" knowledge discovered. For example, one class of hidden knowledge already uncovered is the implicit recording of data as its reciprocal rather than the value itself, as in magnetogyric versus gyromagnetic ratios. The third step is to design/propose classes of solutions for these classes of problems. For instance, reciprocals are often helped by being explicit with units of measurement. Careful design of metadata display and curation widgets can help expose and document tacit knowledge which would otherwise be lost.

Lack of Reproducibility in Addiction Medicine

Background and aims: Credible research emphasizes transparency, openness, and reproducibility. These characteristics are fundamental to promoting and maintaining research integrity. This aim of this study was to evaluate the current state of transparency and reproducibility in the field of addiction science. Design: Cross-sectional design Measurements: This study used the National Library of Medicine catalog to search for all journals using the subject terms tag: Substance-Related Disorders [ST]. Journals were then searched via PubMed in the timeframe of January 1, 2014 to December 31, 2018 and 300 publications were randomly selected. A pilot-tested Google form containing reproducibility/transparency characteristics was used for data extraction by two investigators who performed this process in a duplicate and blinded fashion. Findings: Slightly more than half of the publications were open access (152/293, 50.7%). Few publications had pre-registration (7/244, 2.87%), material availability (2/237, 1.23%), protocol availability (3/244 ,0.80%), data availability (28/244, 11.48%), and analysis script availability (2/244, 0.82%). Most publications provided a conflict of interest statement (221/293, 75.42%) and funding sources (268/293, 91.47%). One replication study was reported (1/244, 0.04%). Few publications were cited (64/238, 26.89%) and 0 were excluded from meta-analyses and/or systematic reviews. Conclusion: Our study found that current practices that promote transparency and reproducibility are lacking, and thus, there is much room for improvement. First, investigators should preregister studies prior to commencement. Researchers should also make the materials, data, analysis script publicly available. To foster reproducibility, individuals should remain transparent about funding sources for the project and financial conflicts of interest. Research stakeholders should work together toward improved solutions on these matters. With these protections in place, the field of addiction medicine can lead in dissemination of information necessary to treat patients.

On the Index of Repeatability: Estimation and Sample Size Requirements

Background: Repeatability is a statement on the magnitude of measurement error. When biomarkers are used for disease diagnoses, they should be measured accurately. Objectives: We derive an index of repeatability based on the ratio of two variance components. Estimation of the index is derived from the one-way Analysis of Variance table based on the one-way random effects model. We estimate the large sample variance of the estimator and assess its adequacy using bootstrap methods. An important requirement for valid estimation of repeatability is the availability of multiple observations on each subject taken by the same rater and under the same conditions. Methods: We use the delta method to derive the large sample variance of the estimate of repeatability index. The question related to the number of required repeats per subjects is answered by two methods. In first methods we estimate the number of repeats that minimizes the variance of the estimated repeatability index, and the second determine the number of repeats needed under cost-constraints. Results and Novel Contribution: The situation when the measurements do not follow Gaussian distribution will be dealt with. It is shown that the required sample size is quite sensitive to the relative cost. We illustrate the methodologies on the Serum Alanine-aminotransferase (ALT) available from hospital registry data for samples of males and females. Repeatability is higher among females in comparison to males.

Reproducible research into human semiochemical cues and pheromones: learning from psychology’s renaissance

As with other mammals, smell in the form of semiochemicals is likely to influence the behaviour of humans, as olfactory cues to emotions, health, and mate choice. A subset of semiochemicals, pheromones, chemical signals within a species, have been identified in many mammal species. As mammals, we may have pheromones too. Sadly, the story of molecules claimed to be ‘putative human pheromones’ is a classic example of bad science carried out by good scientists. Much of human semiochemicals research including work on ‘human pheromones’ and olfactory cues comes within the field of psychology. Thus, the research is highly likely to be affected by the ‘reproducibility crisis’ in psychology and other life sciences. Psychology researchers have responded with proposals to enable better, more reliable science, with an emphasis on enhancing reproducibility. A key change is the adoption of study pre-registration which will also reduce publication bias. Human semiochemicals research would benefit from adopting these proposals.