Teaching genomic data analysis with Galaxy

Teaching genomic data analysis with @jxtx / #usegalaxy https://speakerdeck.com/jxtx

0. Background (about me) 1. Rigor and Reproducibility 2. Galaxy
3. Teaching and Training with Galaxy 4. Deployment options

0. Background

I completed my PhD (CS) at Penn State in 2006
working on methods for understanding gene regulation using comparative genomic data While there, I started Galaxy with Anton Nekrutenko as a way to facilitate better collaborations between computational and experimental researchers Since then, I’ve continued (with Anton) to lead the Galaxy project in my groups at Emory and Johns Hopkins

I teach various classes of diﬀerent types Fundamentals of Genome
Informatics: one semester undergraduate class on the core technologies and algorithms of genomics: no programming, uses Galaxy for assignments. Quantitative Biology Bootcamp: a one week intensive bootcamp for entering Biology PhD students at Hopkins: hands on learning to work at the UNIX command line, basic Python programming, using genomic data examples, followed by a year long weekly lab.

CSHL Computational Genomics: Two week course that has been taught
at CSHL for decades, covers a wide variety of topics in comparative and computational genomics, has used Galaxy since 2010, also covers some UNIX, R/RStudio, project based. Genomic Data Science: I teach one course of nine in the Genomic Data Science Coursera Specialization, a popular MOOC sequence covering many aspects of genomic data analysis. Penguin Coast: …

1. Rigor and Reproducibility

What happens to traditional research outputs when an area of
science rapidly become data intensive?

Idea Experiment Raw Data Tidy Data Summarized data Results Experimental
design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication

Questions one might ask about a published analysis Is the
analysis as described correct? Was the analysis performed as described? Can the analysis be re-created exactly?

What is reproducibility? (for computational analyses) Reproducibility means that an
analysis is described/captured in suﬃcient detail that it can be precisely reproduced Reproducibility is not provenance, reusability/ generalizability, or correctness A minimum standard for evaluating analyses

Microarray Experiment Reproducibility • 18 Nat. Genetics microarray gene expression
experiments • Less than 50% reproducible • Problems • missing data (38%) • missing software, hardware details (50%) • missing method, processing details (66%) Ioannidis, J.P.A. et al. Repeatability of published microarray gene expression analyses. Nat Genet 41, 149-155 (2009)

http://dx.doi.org/10.1038/nrg3305

NGS Re-sequencing Experiment Reproducibility

• Consider a sample 50 papers from 2011 that used
bwa for mapping (of ~380 published): • 36 did not provide primary data (all but 2 provided it upon request) • 31 provide neither parameters nor versions • 19 provide settings only and 8 list versions only • Only 7 provide all details

A core challenge of reproducibility is identifiability Given a methods
section, can we actually identify the resources, data, software … that was actually used?

Submitted 2 June 2013 On the reproducibility of science: unique
identification of research resources in the biomedical literature Nicole A. Vasilevsky1, Matthew H. Brush1, Holly Paddock2, Laura Ponting3, Shreejoy J. Tripathy4, Gregory M. LaRocca4 and Melissa A. Haendel1 1 Ontology Development Group, Library, Oregon Health & Science University, Portland, OR, USA 2 Zebrafish Information Framework, University of Oregon, Eugene, OR, USA 3 FlyBase, Department of Genetics, University of Cambridge, Cambridge, UK 4 Department of Biological Sciences and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA, USA ABSTRACT Scientific reproducibility has been at the forefront of many news stories and there exist numerous initiatives to help address this problem. We posit that a contributor is simply a lack of specificity that is required to enable adequate research reproducibility. In particular, the inability to uniquely identify research resources, such as antibodies and model organisms, makes it diYcult or impossible to reproduce experiments even where the science is otherwise sound. In order to better understand the magnitude of this problem, we designed an experiment to ascertain the “identifiability” of research resources in the biomedical literature. We evaluated recent journal articles in the fields of Neuroscience, Developmental Biology, Immunology, Cell and Molecular Biology and General Biology, selected randomly based on a diversity of impact factors for the journals, publishers, and experimental method reporting guidelines. We attempted to uniquely identify model organisms (mouse, rat, zebrafish, worm, fly and yeast), antibodies, knockdown reagents (morpholinos or RNAi), constructs, and cell lines. Specific criteria were developed to determine if a resource was uniquely identifiable, and included examining relevant repositories

Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa;
Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers

#METHODSMATTER Figure 1 0.480 0.483 0.486 0.489 0.492 0.495 0.498
0.501 0.504 0.507 0.510 5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1 Frequency Fluctuation for site 8992 Default -n 3 -q 15 -n 3 -q 15 (Nekrutenko and Taylor, Nature Reviews Genetics, 2012)

Core reproducibility tasks 1. Capture the precise description of the
experiment (either as it is being carried out, or after the fact) 2. Assemble all of the necessary data and software dependencies needed by the described experiment 3. Combine the above to verify the analysis

Recommendations for performing reproducible computational research Nekrutenko and Taylor, Nature
Reviews Genetics, 2012 Sandve, Nekrutenko, Taylor and Hovig, PLoS Computational Biology 2013

1. Accept that computation is an integral component of biomedical
research. Familiarize yourself with best practices of scientific computing, and implement good computational practices in your group

2. Always provide access to raw primary data

3. Record versions of all auxiliary datasets used in analysis.
Many analyses require data such as genome annotations from external databases that change regularly, either record versions or store a copy of the specific data used.

4. Store the exact versions of all software used. Ideally
archive the software to ensure it can be recovered later.

5. Record all parameters, even if default values are used.
Default settings can change over time and determining what those settings were later can sometimes be diﬃcult.

6. Record and provide exact versions of any custom scripts
used.

7. Do not reinvent the wheel, use existing software and
pipelines when appropriate to contribute to the development of best practices.

Is reproducibility achievable?

A spectrum of solutions Analysis environments (Galaxy, GenePattern, …) Workflow
systems (Taverna, Pegasus, VisTrails, …) Notebook style (Jupyter notebook, …) Literate programming style (Sweave/knitR, …) System level provenance capture (ReproZip, …) Complete environment capture (VMs, containers, …)

Goals Accessibility: Eliminate barriers for researchers wanting to use complex
methods, make these methods available to everyone Transparency: Facilitate communication of analyses and results in ways that are easy to understand while providing all details Reproducibility: Ensure that analysis performed in the system can be reproduced precisely and practically

Galaxy: accessible analysis system

A free (for everyone) web service integrating a wealth of
tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...

Integrating existing tools into a uniform framework • Defined in
terms of an abstract interface (inputs and outputs) • In practice, mostly command line tools, a declarative XML description of the interface, how to generate a command line • Designed to be as easy as possible for tool authors, while still allowing rigorous reasoning

Galaxy analysis interface • Consistent tool user interfaces automatically generated
• History system facilitates and tracks multistep analyses

Automatically and transparently tracks every step of every analysis

As well as user-generated metadata and annotation...

Galaxy workflow system • Workflows can be constructed from scratch
or extracted from existing analysis histories • Facilitate reuse, as well as providing precise reproducibility of a complex analysis

Example: Workflow for diﬀerential expression analysis of RNA-seq using Tophat/
Cuﬄinks tools

Galaxy Pages for publishing analysis

Actual histories and datasets directly accessible from the text

Histories can be imported and the exact parameters inspected

Describe analysis tool behavior abstractly Analysis environment automatically and transparently
tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis

Visualization and visual analytics

3. Teaching and Training

Bioinformatics Learning Curve It’s a long hard climb when you
are new at it. (Dave Clements)

(Dave Clements) But some things can be avoided when you
start: • Linux install • Linux Admin • Command line • Tool installs The Galaxy Goal: Focus on the questions and techniques, rather than on the compute infrastructure.

Galaxy Training Network https://galaxyproject.org/teach/gtn/

Separation between content and formatting Separation between content and formatting
(Bérénice Batut)

Materials organized by topic Topics for different targeted users (Bérénice
Batut)

Similar structure, content, formats Similar structure, content and formats (Bérénice
Batut)

1. 2. 3. 4.

Batut)

Docker Builds on Linux kernel features enabling complete isolation from
the kernel level up Containers — lightweight environments with isolation enforced at the OS level, complete control over all software Adds a complete ecosystem for sharing, versioning, managing containers — Docker hub

Galaxy + Docker Run every analysis in a clean container
— analysis are isolated and environment is the same every time Archive that container — containers are lightweight thanks to layers — and the analysis can always be recreated

Batut)

Work-in-progress: Training materials include example workflows and test data; we
now have tools to automatically test these workflows and verify the outputs are correct

4. Deployment Options

Options for Galaxy Training include using existing instances (of which
there are now many), deploying your own instance at an institution or course level, or having students deploy there own instances. All have strengths and weaknesses depending on the training goals

PSC, Pittsburgh Stampede • 462,462 cores • 205 TB memory
Blacklight Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster   (Rodeo) • 256 cores • 2 TB memory Corral/Stockyard • 20 PB disk funded by the National Science Foundation Award #ACI-1445604 PTI IU Bloomington Galaxy main (http://usegalaxy.org): Leveraging National Cyberinfrastructure

NSF Cloud for research and education Support the “long tail
of science” where traditional HPC resources have not served well Jetstream will enable archival of volumes and virtual machines in the IU Scholarworks Archive with DOIs attached Exact analysis environment, data, and provenance becomes an archived, citable entity funded by the National Science Foundation Award #ACI-1445604

90+ Other public Galaxy Servers bit.ly/gxyServers

Galaxy main is always under significant load, which can lead
to relatively long wait times, especially for larger jobs (sequence mapping, et cetera). Can be fine for homework assignments and exercises completed oﬄine, but diﬃcult for in person training

Galaxy’s workflow system is robust, flexible, and integrates with nearly
any environment Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere

CloudMan: General purpose deployment manager for any cloud. Cluster and
service management, auto-scaling Cloudbridge: New abstraction library for working with multiple cloud APIs Genomics Virtual Lab: CloudMan + Galaxy + many other common bioinformatics tools and frameworks Galaxy Cloud

Dedicated Galaxy instances on various Cloud environments

Share a snapshot of this instance Complete instances can be
archived and shared with others Reproducibility assistance from CloudMan

All Cloudman and Galaxy Features available Nearly unlimited scalability Costs
money, but education grants are available funded by the National Science Foundation Award #ACI-1445604 All Cloudman and Galaxy Features available Fewer node types and available resources Educational allocations available through XSEDE (Also available on Google Compute and Azure but some features like autoscaling are still in development)

Preparing Cloud Instances for Training (WIP)

Batut)

Acknowledgements Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,
Martin Cěch, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, Jeremy Goecks, Björn Grüning, Sam Guerler,   Mo Heydarian, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric Rasche, Nicola Soranzo, Marius van den Beek Our lab: Enis Afgan, Dannon Baker, Boris Brenerman, Min Hyung Cho, Dave Clements, Peter DeFord, Sam Guerler, Nathan Roach, Michael E. G. Sauria, German Uritskiy Collaborators:  Craig Stewart and the group  Ross Hardison and the VISION group  Victor Corces (Emory), Karen Reddy (JHU)  Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology)  Battle, Goﬀ, Langmead, Leek, Schatz, Timp labs (JHU Genomics) NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620)  NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103) funded by the National Science Foundation Award #ACI-1445604

Teaching genomic data analysis with Galaxy

Teaching genomic data analysis with Galaxy

More Decks by James Taylor

Other Decks in Science

Featured

Transcript