Teaching genomic data analysis with Galaxy

Slide 1

Slide 1 text

Teaching genomic data analysis with @jxtx / #usegalaxy https://speakerdeck.com/jxtx

Slide 2

Slide 2 text

0. Background (about me) 1. Rigor and Reproducibility 2. Galaxy 3. Teaching and Training with Galaxy 4. Deployment options

Slide 3

Slide 3 text

0. Background

Slide 4

Slide 4 text

I completed my PhD (CS) at Penn State in 2006 working on methods for understanding gene regulation using comparative genomic data While there, I started Galaxy with Anton Nekrutenko as a way to facilitate better collaborations between computational and experimental researchers Since then, I’ve continued (with Anton) to lead the Galaxy project in my groups at Emory and Johns Hopkins

Slide 5

Slide 5 text

I teach various classes of diﬀerent types Fundamentals of Genome Informatics: one semester undergraduate class on the core technologies and algorithms of genomics: no programming, uses Galaxy for assignments. Quantitative Biology Bootcamp: a one week intensive bootcamp for entering Biology PhD students at Hopkins: hands on learning to work at the UNIX command line, basic Python programming, using genomic data examples, followed by a year long weekly lab.

Slide 6

Slide 6 text

CSHL Computational Genomics: Two week course that has been taught at CSHL for decades, covers a wide variety of topics in comparative and computational genomics, has used Galaxy since 2010, also covers some UNIX, R/RStudio, project based. Genomic Data Science: I teach one course of nine in the Genomic Data Science Coursera Specialization, a popular MOOC sequence covering many aspects of genomic data analysis. Penguin Coast: …

Slide 7

Slide 7 text

1. Rigor and Reproducibility

Slide 8

Slide 8 text

What happens to traditional research outputs when an area of science rapidly become data intensive?

Slide 9

Slide 9 text

Idea Experiment Raw Data Tidy Data Summarized data Results Experimental design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication

Slide 10

Slide 10 text

Questions one might ask about a published analysis Is the analysis as described correct? Was the analysis performed as described? Can the analysis be re-created exactly?

Slide 11

Slide 11 text

What is reproducibility? (for computational analyses) Reproducibility means that an analysis is described/captured in suﬃcient detail that it can be precisely reproduced Reproducibility is not provenance, reusability/ generalizability, or correctness A minimum standard for evaluating analyses

Slide 12

Slide 12 text

Microarray Experiment Reproducibility • 18 Nat. Genetics microarray gene expression experiments • Less than 50% reproducible • Problems • missing data (38%) • missing software, hardware details (50%) • missing method, processing details (66%) Ioannidis, J.P.A. et al. Repeatability of published microarray gene expression analyses. Nat Genet 41, 149-155 (2009)

Slide 13

Slide 13 text

http://dx.doi.org/10.1038/nrg3305

Slide 14

Slide 14 text

NGS Re-sequencing Experiment Reproducibility

Slide 15

Slide 15 text

NGS Re-sequencing Experiment Reproducibility

Slide 16

Slide 16 text

• Consider a sample 50 papers from 2011 that used bwa for mapping (of ~380 published): • 36 did not provide primary data (all but 2 provided it upon request) • 31 provide neither parameters nor versions • 19 provide settings only and 8 list versions only • Only 7 provide all details

Slide 17

Slide 17 text

A core challenge of reproducibility is identifiability Given a methods section, can we actually identify the resources, data, software … that was actually used?

Slide 18

Slide 18 text

Submitted 2 June 2013 On the reproducibility of science: unique identification of research resources in the biomedical literature Nicole A. Vasilevsky1, Matthew H. Brush1, Holly Paddock2, Laura Ponting3, Shreejoy J. Tripathy4, Gregory M. LaRocca4 and Melissa A. Haendel1 1 Ontology Development Group, Library, Oregon Health & Science University, Portland, OR, USA 2 Zebrafish Information Framework, University of Oregon, Eugene, OR, USA 3 FlyBase, Department of Genetics, University of Cambridge, Cambridge, UK 4 Department of Biological Sciences and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA, USA ABSTRACT Scientific reproducibility has been at the forefront of many news stories and there exist numerous initiatives to help address this problem. We posit that a contributor is simply a lack of specificity that is required to enable adequate research reproducibility. In particular, the inability to uniquely identify research resources, such as antibodies and model organisms, makes it diYcult or impossible to reproduce experiments even where the science is otherwise sound. In order to better understand the magnitude of this problem, we designed an experiment to ascertain the “identifiability” of research resources in the biomedical literature. We evaluated recent journal articles in the fields of Neuroscience, Developmental Biology, Immunology, Cell and Molecular Biology and General Biology, selected randomly based on a diversity of impact factors for the journals, publishers, and experimental method reporting guidelines. We attempted to uniquely identify model organisms (mouse, rat, zebrafish, worm, fly and yeast), antibodies, knockdown reagents (morpholinos or RNAi), constructs, and cell lines. Specific criteria were developed to determine if a resource was uniquely identifiable, and included examining relevant repositories

Slide 19

Slide 19 text

Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa; Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers

Slide 20

Slide 20 text

#METHODSMATTER Figure 1 0.480 0.483 0.486 0.489 0.492 0.495 0.498 0.501 0.504 0.507 0.510 5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1 Frequency Fluctuation for site 8992 Default -n 3 -q 15 -n 3 -q 15 (Nekrutenko and Taylor, Nature Reviews Genetics, 2012)

Slide 21

Slide 21 text

Core reproducibility tasks 1. Capture the precise description of the experiment (either as it is being carried out, or after the fact) 2. Assemble all of the necessary data and software dependencies needed by the described experiment 3. Combine the above to verify the analysis

Slide 22

Slide 22 text

Recommendations for performing reproducible computational research Nekrutenko and Taylor, Nature Reviews Genetics, 2012 Sandve, Nekrutenko, Taylor and Hovig, PLoS Computational Biology 2013

Slide 23

Slide 23 text

1. Accept that computation is an integral component of biomedical research. Familiarize yourself with best practices of scientific computing, and implement good computational practices in your group

Slide 24

Slide 24 text

2. Always provide access to raw primary data

Slide 25

Slide 25 text

3. Record versions of all auxiliary datasets used in analysis. Many analyses require data such as genome annotations from external databases that change regularly, either record versions or store a copy of the specific data used.

Slide 26

Slide 26 text

4. Store the exact versions of all software used. Ideally archive the software to ensure it can be recovered later.

Slide 27

Slide 27 text

5. Record all parameters, even if default values are used. Default settings can change over time and determining what those settings were later can sometimes be diﬃcult.

Slide 28

Slide 28 text

6. Record and provide exact versions of any custom scripts used.

Slide 29

Slide 29 text

7. Do not reinvent the wheel, use existing software and pipelines when appropriate to contribute to the development of best practices.

Slide 30

Slide 30 text

Is reproducibility achievable?

Slide 31

Slide 31 text

A spectrum of solutions Analysis environments (Galaxy, GenePattern, …) Workflow systems (Taverna, Pegasus, VisTrails, …) Notebook style (Jupyter notebook, …) Literate programming style (Sweave/knitR, …) System level provenance capture (ReproZip, …) Complete environment capture (VMs, containers, …)

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Goals Accessibility: Eliminate barriers for researchers wanting to use complex methods, make these methods available to everyone Transparency: Facilitate communication of analyses and results in ways that are easy to understand while providing all details Reproducibility: Ensure that analysis performed in the system can be reproduced precisely and practically

Slide 34

Slide 34 text

Galaxy: accessible analysis system

Slide 35

Slide 35 text

A free (for everyone) web service integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...

Slide 36

Slide 36 text

Integrating existing tools into a uniform framework • Defined in terms of an abstract interface (inputs and outputs) • In practice, mostly command line tools, a declarative XML description of the interface, how to generate a command line • Designed to be as easy as possible for tool authors, while still allowing rigorous reasoning

Slide 37

Slide 37 text

Galaxy analysis interface • Consistent tool user interfaces automatically generated • History system facilitates and tracks multistep analyses

Slide 38

Slide 38 text

Automatically and transparently tracks every step of every analysis

Slide 39

Slide 39 text

As well as user-generated metadata and annotation...

Slide 40

Slide 40 text

Galaxy workflow system • Workflows can be constructed from scratch or extracted from existing analysis histories • Facilitate reuse, as well as providing precise reproducibility of a complex analysis

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

Example: Workflow for diﬀerential expression analysis of RNA-seq using Tophat/ Cuﬄinks tools

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

Galaxy Pages for publishing analysis

Slide 45

Slide 45 text

Actual histories and datasets directly accessible from the text

Slide 46

Slide 46 text

Histories can be imported and the exact parameters inspected

Slide 47

Slide 47 text

Describe analysis tool behavior abstractly Analysis environment automatically and transparently tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis

Slide 48

Slide 48 text

Visualization and visual analytics

Slide 49

Slide 49 text

3. Teaching and Training

Slide 50

Slide 50 text

Bioinformatics Learning Curve It’s a long hard climb when you are new at it. (Dave Clements)

Slide 51

Slide 51 text

(Dave Clements) But some things can be avoided when you start: ● Linux install ● Linux Admin ● Command line ● Tool installs The Galaxy Goal: Focus on the questions and techniques, rather than on the compute infrastructure.

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Galaxy Training Network https://galaxyproject.org/teach/gtn/

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

Separation between content and formatting Separation between content and formatting (Bérénice Batut)

Slide 64

Slide 64 text

Materials organized by topic Topics for different targeted users (Bérénice Batut)

Slide 65

Slide 65 text

Similar structure, content, formats Similar structure, content and formats (Bérénice Batut)

Slide 66

Slide 66 text

Similar structure, content, formats Similar structure, content and formats (Bérénice Batut)

Slide 67

Slide 67 text

1. 2. 3. 4.

Slide 68

Slide 68 text

Similar structure, content, formats Similar structure, content and formats (Bérénice Batut)

Slide 69

Slide 69 text

No content

Slide 70

Slide 70 text

Docker Builds on Linux kernel features enabling complete isolation from the kernel level up Containers — lightweight environments with isolation enforced at the OS level, complete control over all software Adds a complete ecosystem for sharing, versioning, managing containers — Docker hub

Slide 71

Slide 71 text

Galaxy + Docker Run every analysis in a clean container — analysis are isolated and environment is the same every time Archive that container — containers are lightweight thanks to layers — and the analysis can always be recreated

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

Similar structure, content, formats Similar structure, content and formats (Bérénice Batut)

Slide 74

Slide 74 text

Work-in-progress: Training materials include example workflows and test data; we now have tools to automatically test these workflows and verify the outputs are correct

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

No content

Slide 79

Slide 79 text

4. Deployment Options

Slide 80

Slide 80 text

Options for Galaxy Training include using existing instances (of which there are now many), deploying your own instance at an institution or course level, or having students deploy there own instances. All have strengths and weaknesses depending on the training goals

Slide 81

Slide 81 text

PSC, Pittsburgh Stampede ● 462,462 cores ● 205 TB memory Blacklight Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster   (Rodeo) ● 256 cores ● 2 TB memory Corral/Stockyard ● 20 PB disk funded by the National Science Foundation Award #ACI-1445604 PTI IU Bloomington Galaxy main (http://usegalaxy.org): Leveraging National Cyberinfrastructure

Slide 82

Slide 82 text

NSF Cloud for research and education Support the “long tail of science” where traditional HPC resources have not served well Jetstream will enable archival of volumes and virtual machines in the IU Scholarworks Archive with DOIs attached Exact analysis environment, data, and provenance becomes an archived, citable entity funded by the National Science Foundation Award #ACI-1445604

Slide 83

Slide 83 text

90+ Other public Galaxy Servers bit.ly/gxyServers

Slide 84

Slide 84 text

Galaxy main is always under significant load, which can lead to relatively long wait times, especially for larger jobs (sequence mapping, et cetera). Can be fine for homework assignments and exercises completed oﬄine, but diﬃcult for in person training

Slide 85

Slide 85 text

Galaxy’s workflow system is robust, flexible, and integrates with nearly any environment Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere

Slide 86

Slide 86 text

CloudMan: General purpose deployment manager for any cloud. Cluster and service management, auto-scaling Cloudbridge: New abstraction library for working with multiple cloud APIs Genomics Virtual Lab: CloudMan + Galaxy + many other common bioinformatics tools and frameworks Galaxy Cloud

Slide 87

Slide 87 text

Dedicated Galaxy instances on various Cloud environments

Slide 88

Slide 88 text

Share a snapshot of this instance Complete instances can be archived and shared with others Reproducibility assistance from CloudMan

Slide 89

Slide 89 text

No content

Slide 90

Slide 90 text

No content

Slide 91

Slide 91 text

No content

Slide 92

Slide 92 text

All Cloudman and Galaxy Features available Nearly unlimited scalability Costs money, but education grants are available funded by the National Science Foundation Award #ACI-1445604 All Cloudman and Galaxy Features available Fewer node types and available resources Educational allocations available through XSEDE (Also available on Google Compute and Azure but some features like autoscaling are still in development)

Slide 93

Slide 93 text

Preparing Cloud Instances for Training (WIP)

Slide 94

Slide 94 text

Similar structure, content, formats Similar structure, content and formats (Bérénice Batut)

Slide 95

Slide 95 text

No content

Slide 96

Slide 96 text

Acknowledgements Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier, Martin Cěch, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, Jeremy Goecks, Björn Grüning, Sam Guerler,   Mo Heydarian, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric Rasche, Nicola Soranzo, Marius van den Beek Our lab: Enis Afgan, Dannon Baker, Boris Brenerman, Min Hyung Cho, Dave Clements, Peter DeFord, Sam Guerler, Nathan Roach, Michael E. G. Sauria, German Uritskiy Collaborators:  Craig Stewart and the group  Ross Hardison and the VISION group  Victor Corces (Emory), Karen Reddy (JHU)  Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology)  Battle, Goﬀ, Langmead, Leek, Schatz, Timp labs (JHU Genomics) NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620)  NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103) funded by the National Science Foundation Award #ACI-1445604