Teaching genomic data analysis with Galaxy

Teaching genomic data analysis with Galaxy

A little bit on Galaxy and the Training Network for "Big Genomic Data Skills Training for Professors" at Jackson Labs.

3ee44f53c39bcd4bc663a2ea0e21d526?s=128

James Taylor

May 23, 2018
Tweet

Transcript

  1. Teaching genomic data analysis with @jxtx / #usegalaxy https://speakerdeck.com/jxtx

  2. 0. Background (about me) 1. Rigor and Reproducibility 2. Galaxy

    3. Teaching and Training with Galaxy 4. Deployment options
  3. 0. Background

  4. I completed my PhD (CS) at Penn State in 2006

    working on methods for understanding gene regulation using comparative genomic data While there, I started Galaxy with Anton Nekrutenko as a way to facilitate better collaborations between computational and experimental researchers Since then, I’ve continued (with Anton) to lead the Galaxy project in my groups at Emory and Johns Hopkins
  5. I teach various classes of different types Fundamentals of Genome

    Informatics: one semester undergraduate class on the core technologies and algorithms of genomics: no programming, uses Galaxy for assignments. Quantitative Biology Bootcamp: a one week intensive bootcamp for entering Biology PhD students at Hopkins: hands on learning to work at the UNIX command line, basic Python programming, using genomic data examples, followed by a year long weekly lab.
  6. CSHL Computational Genomics: Two week course that has been taught

    at CSHL for decades, covers a wide variety of topics in comparative and computational genomics, has used Galaxy since 2010, also covers some UNIX, R/RStudio, project based. Genomic Data Science: I teach one course of nine in the Genomic Data Science Coursera Specialization, a popular MOOC sequence covering many aspects of genomic data analysis. Penguin Coast: …
  7. 1. Rigor and Reproducibility

  8. What happens to traditional research outputs when an area of

    science rapidly become data intensive?
  9. Idea Experiment Raw Data Tidy Data Summarized data Results Experimental

    design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication
  10. Questions one might ask about a published analysis Is the

    analysis as described correct? Was the analysis performed as described? Can the analysis be re-created exactly?
  11. What is reproducibility? (for computational analyses) Reproducibility means that an

    analysis is described/captured in sufficient detail that it can be precisely reproduced Reproducibility is not provenance, reusability/ generalizability, or correctness A minimum standard for evaluating analyses
  12. Microarray Experiment Reproducibility • 18 Nat. Genetics microarray gene expression

    experiments • Less than 50% reproducible • Problems • missing data (38%) • missing software, hardware details (50%) • missing method, processing details (66%) Ioannidis, J.P.A. et al. Repeatability of published microarray gene expression analyses. Nat Genet 41, 149-155 (2009)
  13. http://dx.doi.org/10.1038/nrg3305

  14. NGS Re-sequencing Experiment Reproducibility

  15. NGS Re-sequencing Experiment Reproducibility

  16. • Consider a sample 50 papers from 2011 that used

    bwa for mapping (of ~380 published): • 36 did not provide primary data (all but 2 provided it upon request) • 31 provide neither parameters nor versions • 19 provide settings only and 8 list versions only • Only 7 provide all details
  17. A core challenge of reproducibility is identifiability Given a methods

    section, can we actually identify the resources, data, software … that was actually used?
  18. Submitted 2 June 2013 On the reproducibility of science: unique

    identification of research resources in the biomedical literature Nicole A. Vasilevsky1, Matthew H. Brush1, Holly Paddock2, Laura Ponting3, Shreejoy J. Tripathy4, Gregory M. LaRocca4 and Melissa A. Haendel1 1 Ontology Development Group, Library, Oregon Health & Science University, Portland, OR, USA 2 Zebrafish Information Framework, University of Oregon, Eugene, OR, USA 3 FlyBase, Department of Genetics, University of Cambridge, Cambridge, UK 4 Department of Biological Sciences and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA, USA ABSTRACT Scientific reproducibility has been at the forefront of many news stories and there exist numerous initiatives to help address this problem. We posit that a contributor is simply a lack of specificity that is required to enable adequate research repro- ducibility. In particular, the inability to uniquely identify research resources, such as antibodies and model organisms, makes it diYcult or impossible to reproduce experiments even where the science is otherwise sound. In order to better understand the magnitude of this problem, we designed an experiment to ascertain the “iden- tifiability” of research resources in the biomedical literature. We evaluated recent journal articles in the fields of Neuroscience, Developmental Biology, Immunology, Cell and Molecular Biology and General Biology, selected randomly based on a diversity of impact factors for the journals, publishers, and experimental method reporting guidelines. We attempted to uniquely identify model organisms (mouse, rat, zebrafish, worm, fly and yeast), antibodies, knockdown reagents (morpholinos or RNAi), constructs, and cell lines. Specific criteria were developed to determine if a resource was uniquely identifiable, and included examining relevant repositories
  19. Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa;

    Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers
  20. #METHODSMATTER Figure 1 0.480 0.483 0.486 0.489 0.492 0.495 0.498

    0.501 0.504 0.507 0.510 5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1 Frequency Fluctuation for site 8992 Default -n 3 -q 15 -n 3 -q 15 (Nekrutenko and Taylor, Nature Reviews Genetics, 2012)
  21. Core reproducibility tasks 1. Capture the precise description of the

    experiment (either as it is being carried out, or after the fact) 2. Assemble all of the necessary data and software dependencies needed by the described experiment 3. Combine the above to verify the analysis
  22. Recommendations for performing reproducible computational research Nekrutenko and Taylor, Nature

    Reviews Genetics, 2012 Sandve, Nekrutenko, Taylor and Hovig, PLoS Computational Biology 2013
  23. 1. Accept that computation is an integral component of biomedical

    research. Familiarize yourself with best practices of scientific computing, and implement good computational practices in your group
  24. 2. Always provide access to raw primary data

  25. 3. Record versions of all auxiliary datasets used in analysis.

    Many analyses require data such as genome annotations from external databases that change regularly, either record versions or store a copy of the specific data used.
  26. 4. Store the exact versions of all software used. Ideally

    archive the software to ensure it can be recovered later.
  27. 5. Record all parameters, even if default values are used.

    Default settings can change over time and determining what those settings were later can sometimes be difficult.
  28. 6. Record and provide exact versions of any custom scripts

    used.
  29. 7. Do not reinvent the wheel, use existing software and

    pipelines when appropriate to contribute to the development of best practices.
  30. Is reproducibility achievable?

  31. A spectrum of solutions Analysis environments (Galaxy, GenePattern, …) Workflow

    systems (Taverna, Pegasus, VisTrails, …) Notebook style (Jupyter notebook, …) Literate programming style (Sweave/knitR, …) System level provenance capture (ReproZip, …) Complete environment capture (VMs, containers, …)
  32. 2.

  33. Goals Accessibility: Eliminate barriers for researchers wanting to use complex

    methods, make these methods available to everyone Transparency: Facilitate communication of analyses and results in ways that are easy to understand while providing all details Reproducibility: Ensure that analysis performed in the system can be reproduced precisely and practically
  34. Galaxy: accessible analysis system

  35. A free (for everyone) web service integrating a wealth of

    tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  36. Integrating existing tools into a uniform framework • Defined in

    terms of an abstract interface (inputs and outputs) • In practice, mostly command line tools, a declarative XML description of the interface, how to generate a command line • Designed to be as easy as possible for tool authors, while still allowing rigorous reasoning
  37. Galaxy analysis interface • Consistent tool user interfaces automatically generated

    • History system facilitates and tracks multistep analyses
  38. Automatically and transparently tracks every step of every analysis

  39. As well as user-generated metadata and annotation...

  40. Galaxy workflow system • Workflows can be constructed from scratch

    or extracted from existing analysis histories • Facilitate reuse, as well as providing precise reproducibility of a complex analysis
  41. None
  42. Example: Workflow for differential expression analysis of RNA-seq using Tophat/

    Cufflinks tools
  43. None
  44. Galaxy Pages for publishing analysis

  45. Actual histories and datasets directly accessible from the text

  46. Histories can be imported and the exact parameters inspected

  47. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis
  48. Visualization and visual analytics

  49. 3. Teaching and Training

  50. Bioinformatics Learning Curve It’s a long hard climb when you

    are new at it. (Dave Clements)
  51. (Dave Clements) But some things can be avoided when you

    start: • Linux install • Linux Admin • Command line • Tool installs The Galaxy Goal: Focus on the questions and techniques, rather than on the compute infrastructure.
  52. None
  53. Galaxy Training Network https://galaxyproject.org/teach/gtn/

  54. None
  55. None
  56. None
  57. None
  58. None
  59. None
  60. None
  61. None
  62. None
  63. Separation between content and formatting Separation between content and formatting

    (Bérénice Batut)
  64. Materials organized by topic Topics for different targeted users (Bérénice

    Batut)
  65. Similar structure, content, formats Similar structure, content and formats (Bérénice

    Batut)
  66. Similar structure, content, formats Similar structure, content and formats (Bérénice

    Batut)
  67. 1. 2. 3. 4.

  68. Similar structure, content, formats Similar structure, content and formats (Bérénice

    Batut)
  69. None
  70. Docker Builds on Linux kernel features enabling complete isolation from

    the kernel level up Containers — lightweight environments with isolation enforced at the OS level, complete control over all software Adds a complete ecosystem for sharing, versioning, managing containers — Docker hub
  71. Galaxy + Docker Run every analysis in a clean container

    — analysis are isolated and environment is the same every time Archive that container — containers are lightweight thanks to layers — and the analysis can always be recreated
  72. None
  73. Similar structure, content, formats Similar structure, content and formats (Bérénice

    Batut)
  74. Work-in-progress: Training materials include example workflows and test data; we

    now have tools to automatically test these workflows and verify the outputs are correct
  75. None
  76. None
  77. None
  78. None
  79. 4. Deployment Options

  80. Options for Galaxy Training include using existing instances (of which

    there are now many), deploying your own instance at an institution or course level, or having students deploy there own instances. All have strengths and weaknesses depending on the training goals
  81. PSC, Pittsburgh Stampede • 462,462 cores • 205 TB memory

    Blacklight Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster 
 (Rodeo) • 256 cores • 2 TB memory Corral/Stockyard • 20 PB disk funded by the National Science Foundation Award #ACI-1445604 PTI IU Bloomington Galaxy main (http://usegalaxy.org): Leveraging National Cyberinfrastructure
  82. NSF Cloud for research and education Support the “long tail

    of science” where traditional HPC resources have not served well Jetstream will enable archival of volumes and virtual machines in the IU Scholarworks Archive with DOIs attached Exact analysis environment, data, and provenance becomes an archived, citable entity funded by the National Science Foundation Award #ACI-1445604
  83. 90+ Other public Galaxy Servers bit.ly/gxyServers

  84. Galaxy main is always under significant load, which can lead

    to relatively long wait times, especially for larger jobs (sequence mapping, et cetera). Can be fine for homework assignments and exercises completed offline, but difficult for in person training
  85. Galaxy’s workflow system is robust, flexible, and integrates with nearly

    any environment Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere
  86. CloudMan: General purpose deployment manager for any cloud. Cluster and

    service management, auto-scaling Cloudbridge: New abstraction library for working with multiple cloud APIs Genomics Virtual Lab: CloudMan + Galaxy + many other common bioinformatics tools and frameworks Galaxy Cloud
  87. Dedicated Galaxy instances on various Cloud environments

  88. Share a snapshot of this instance Complete instances can be

    archived and shared with others Reproducibility assistance from CloudMan
  89. None
  90. None
  91. None
  92. All Cloudman and Galaxy Features available Nearly unlimited scalability Costs

    money, but education grants are available funded by the National Science Foundation Award #ACI-1445604 All Cloudman and Galaxy Features available Fewer node types and available resources Educational allocations available through XSEDE (Also available on Google Compute and Azure but some features like autoscaling are still in development)
  93. Preparing Cloud Instances for Training (WIP)

  94. Similar structure, content, formats Similar structure, content and formats (Bérénice

    Batut)
  95. None
  96. Acknowledgements Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,

    Martin Cěch, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, Jeremy Goecks, Björn Grüning, Sam Guerler, 
 Mo Heydarian, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric Rasche, Nicola Soranzo, Marius van den Beek Our lab: Enis Afgan, Dannon Baker, Boris Brenerman, Min Hyung Cho, Dave Clements, Peter DeFord, Sam Guerler, Nathan Roach, Michael E. G. Sauria, German Uritskiy Collaborators:
 Craig Stewart and the group
 Ross Hardison and the VISION group
 Victor Corces (Emory), Karen Reddy (JHU)
 Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology)
 Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics) NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620)
 NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103) funded by the National Science Foundation Award #ACI-1445604