Reproducibility of Data Collection and Analysis – Modern Technologies in Genome Technology: Potentials and Pitfalls

@jxtx / #methodsmatter Analysis Reproducibility https://speakerdeck.com/jxtx

Idea Experiment Raw Data Tidy Data Summarized data Results Experimental
design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication

Questions one might ask about a published analysis Is the
analysis as described correct? Was the analysis performed as described?

What is reproducibility? (for computational analyses) Reproducibility means that an
analysis is described/captured in suﬃcient detail that it can be precisely reproduced Reproducibility is not provenance, reusability/ generalizability, or correctness A minimum standard for evaluating analyses

A minimum standard for evaluating analyses Yet most published analyses
are not reproducible   Ioannadis et al. 2009 – 6/18 microarray experiments reproducible Nekrutenko and Taylor 2012 – 7/50 re-sequencing experiments reproducible Vasilevsky et al. 2014 – 6/41 cancer biology experiments reproducible* Missing software, versions, parameters, data…

Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa;
Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers

#METHODSMATTER Figure 1 0.480 0.483 0.486 0.489 0.492 0.495 0.498
0.501 0.504 0.507 0.510 5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1 Frequency Fluctuation for site 8992 Default -n 3 -q 15 -n 3 -q 15 (Nekrutenko and Taylor, Nature Reviews Genetics, 2012)

Core reproducibility tasks 1. Capture the precise description of the
experiment (either as it is being carried out, or after the fact) 2. Assemble all of the necessary data and software dependencies needed by the described experiment 3. Combine the above to verify the analysis

Is reproducibility achievable?

A spectrum of solutions Analysis environments (Galaxy, GenePattern, Mobyle, …)
Workflow systems (Taverna, Pegasus, VisTrails, …) Notebook style (iPython notebook, …) Literate programming style (Sweave/knitR, …) System level provenance capture (ReproZip, …) Complete environment capture (VMs, containers, …)

Analysis can easily now easily be packaged with whatever software
is needed to run them, “It only works on my system” is NO LONGER AN ACCEPTABLE EXCUSE

Even partial reproducibility is better than nothing Striving for reproducibility
makes methods more transparent, understandable, leading to better science

Recommendations for performing reproducible computational research Nekrutenko and Taylor, Nature
Reviews Genetics, 2012 Sandve, Nekrutenko, Taylor and Hovig, PLoS Computational Biology 2013

1. Accept that computation is an integral component of biomedical
research. Familiarize yourself with best practices of scientific computing, and implement good computational practices in your group

2. Always provide access to raw primary data

3. Record versions of all auxiliary datasets used in analysis.
Many analyses require data such as genome annotations from external databases that change regularly, either record versions or store a copy of the specific data used.

4. Store the exact versions of all software used. Ideally
archive the software to ensure it can be recovered later.

5. Record all parameters, even if default values are used.
Default settings can change over time and determining what those settings were later can sometimes be diﬃcult.

6. Record and provide exact versions of any custom scripts
used.

7. Do not reinvent the wheel, use existing software and
pipelines when appropriate to contribute to the development of best practices.

Reproducibility is possible, why is it not the norm? Slightly
more diﬃcult than not doing it right Analysts don’t know how to do it right Fear of being critiqued – “my code is too ugly” “why hold myself to a higher standard”

Tools can only fix so much of the problem Need
to create an expectation of reproducibility Require authors to make their work reproducible as part of the peer review process Need to educate analysts

OPINION Opinion: Reproducible research can still be wrong: Adopting a
prevention approach Jeffrey T. Leeka,1 and Roger D. Pengb aAssociate Professor of Biostatistics and Oncology and bAssociate Professor of Biostatistics, Johns Hopkins University, Baltimore, MD Reproducibility—the ability to recompute results—and replicability—the chances other experimenters will achieve a consistent result—are two foundational characteristics of successful scientific research. Consistent findings from independent investigators are the primary means by which scientific evidence accumulates for or against a hy- pothesis. Yet, of late, there has been a crisis of confidence among researchers worried about the rate at which studies are either reproducible or replicable. To maintain the integrity of science research and the public’s trust in science, the scientific community must ensure reproducibility and replicability by engaging in a more preventative approach that greatly expands data analysis education and routinely uses software tools. We define reproducibility as the ability to recompute data analytic results given an observed dataset and knowledge of the data analysis pipeline. The replicability of a study been some very public failings of reproducibility across a range of disciplines from cancer genomics (3) to economics (4), and the data for many publications have not been made publicly available, raising doubts about the quality of data analyses. Popular press articles have raised questions about the reproducibility of all scientific research (5), and the US Congress has convened hearings focused on the transparency of scientific research (6). The result is that much of the scientific enterprise has been called into question, putting funding and hard won scientific truths at risk. From a computational perspective, there are three major components to a reproducible and replicable study: (i) the raw data from the experiment are available, (ii) the statistical code and documentation to reproduce the analysis are available, and (iii) a correct data analysis must be performed. Recent cultural shifts in genomics and other areas have had computational tools such as knitr, iPython notebook, LONI, and Galaxy (8) have simplified the process of distributing reproducible data analyses. Unfortunately, the mere reproducibility of computational results is insufficient to address the replication crisis because even a reproducible analysis can suffer from many problems—confounding from omitted varia- bles, poor study design, missing data—that threaten the validity and useful interpretation of the results. Although improving the reproducibility of research may increase the rate at which flawed analyses are uncovered, as recent high-profile examples have demon- strated (4), it does not change the fact that problematic research is conducted in the first place. The key question we want to answer when seeing the results of any scientific study is “Can I trust this data analysis?” If we think of problematic data analysis as a disease, reproducibility speeds diagnosis and treatment in the form of screening and rejection of poor data analyses by referees, editors, and other scientists in the community (Fig. 1). OPINION education and routinely uses software tools. We define reproducibility as the ability to recompute data analytic results given an observed dataset and knowledge of the data analysis pipeline. The replicability of a study is the chance that an independent experiment targeting the same scientific question will produce a consistent result (1). Con- cerns among scientists about both have gained significant traction recently due in part to a statistical argument that suggested most published scientific results may be false positives (2). At the same time, there have the experiment are available, (ii) the statistical code and documentation to reproduce the analysis are available, and (iii) a correct data analysis must be performed. Recent cultural shifts in genomics and other areas have had a positive impact on data and code availabil- ity. Journals are starting to require data avail- ability as a condition for publication (7), and centralized databases such as the National Center for Biotechnology Information’s Gene Expression Omnibus are being cre- ated for depositing data generated by publicly funded scientific experiments. New problematic data a ducibility speeds d the form of screen data analyses by r scientists in the co This medicatio quality relies on p to make this diagn is a tall order. Edi medical and scie the training and evaluation of a da is compounded b and data analyse ingly complex, th journals continu the demands on are increasing. T duced the efficac tifying and cor discoveries in the cially, the medic address the probl We suggest that to be considered Fig. 1. Peer review and editor evaluation help treat poor data analysis. Education and evidence-based data analysis can be thought of as preventative measures. Author contributions: J.T.L. 1To whom correspondence edu. Any opinions, findings, con pressed in this work are tho reflect the views of the Na www.pnas.org/cgi/doi/10.1073/pnas.1421412111 PNAS | February 10, 2015 |

Reproducibility is only one part of research integrity Need widespread
education on how to conduct computational analyses that are correct and transparent Research should be subject to continuous, constructive, and open peer review Mistakes will be made! Need to create an environment where researchers are willing to be open and transparent enough that these mistakes are found

What about correctness? What is the right way to analyze
data? How do we establish best practices?

57.5% Concordance Which variant calling pipeline should I use? O’Rawe
et al. 2013

Open challenges, or “bake-oﬀs”, are one excellent way to compare
and improve diﬀerent approaches Being able to run a pipeline reproducibly is essential for fair comparisons

Goal: 1) Estimate number of subclone within a population, 2)
assign mutations to subclasses and reconstruct phylogeny

Automated evaluation, contestants submit workflows Contestant Evaluation System Kyle Ellrott,
UCSC

For example, MuTect… The entire MuTect Dockerfile The workflow describes
how to call the program and what additional reference files are needed Kyle Ellrott, UCSC

ACK Jeﬀ Leek, Roger Peng, and the rest of the
JHU Data Science group Anton Nekrutenko, Jeremy Goecks, and the rest of the Galaxy Team Geir Kjetil Sandve and Eivind Hovig Kyle Ellrott and everyone involved in the ICGC- TCGA DREAM SMC-Het challenge

Reproducibility of Data Collection and Analysis...

Reproducibility of Data Collection and Analysis – Modern Technologies in Genome Technology: Potentials and Pitfalls

James Taylor

More Decks by James Taylor

Other Decks in Science

Featured

Transcript

@jxtx / #methodsmatter Analysis Reproducibility https://speakerdeck.com/jxtx

Idea Experiment Raw Data Tidy Data Summarized data Results Experimental

Questions one might ask about a published analysis Is the

What is reproducibility? (for computational analyses) Reproducibility means that an

A minimum standard for evaluating analyses Yet most published analyses

Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa;

#METHODSMATTER Figure 1 0.480 0.483 0.486 0.489 0.492 0.495 0.498

Core reproducibility tasks 1. Capture the precise description of the

Is reproducibility achievable?

A spectrum of solutions Analysis environments (Galaxy, GenePattern, Mobyle, …)

Analysis can easily now easily be packaged with whatever software

Even partial reproducibility is better than nothing Striving for reproducibility

Recommendations for performing reproducible computational research Nekrutenko and Taylor, Nature

1. Accept that computation is an integral component of biomedical

2. Always provide access to raw primary data

3. Record versions of all auxiliary datasets used in analysis.

4. Store the exact versions of all software used. Ideally

5. Record all parameters, even if default values are used.

6. Record and provide exact versions of any custom scripts

7. Do not reinvent the wheel, use existing software and

Reproducibility is possible, why is it not the norm? Slightly

Tools can only fix so much of the problem Need

OPINION Opinion: Reproducible research can still be wrong: Adopting a

Reproducibility is only one part of research integrity Need widespread

What about correctness? What is the right way to analyze

57.5% Concordance Which variant calling pipeline should I use? O’Rawe

Open challenges, or “bake-oﬀs”, are one excellent way to compare

Goal: 1) Estimate number of subclone within a population, 2)

Automated evaluation, contestants submit workflows Contestant Evaluation System Kyle Ellrott,

For example, MuTect… The entire MuTect Dockerfile The workflow describes

ACK Jeﬀ Leek, Roger Peng, and the rest of the