Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Teaching genomic data analysis with Galaxy

Teaching genomic data analysis with Galaxy

A little bit on Galaxy and the Training Network for "Big Genomic Data Skills Training for Professors" at Jackson Labs.

James Taylor

May 23, 2018
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. Teaching genomic data analysis with
    @jxtx / #usegalaxy
    https://speakerdeck.com/jxtx

    View Slide

  2. 0. Background (about me)
    1. Rigor and Reproducibility
    2. Galaxy
    3. Teaching and Training with Galaxy
    4. Deployment options

    View Slide

  3. 0. Background

    View Slide

  4. I completed my PhD (CS) at Penn State in 2006
    working on methods for understanding gene
    regulation using comparative genomic data
    While there, I started Galaxy with Anton
    Nekrutenko as a way to facilitate better
    collaborations between computational and
    experimental researchers
    Since then, I’ve continued (with Anton) to lead
    the Galaxy project in my groups at Emory and
    Johns Hopkins

    View Slide

  5. I teach various classes of different types
    Fundamentals of Genome Informatics: one
    semester undergraduate class on the core
    technologies and algorithms of genomics: no
    programming, uses Galaxy for assignments.
    Quantitative Biology Bootcamp: a one week
    intensive bootcamp for entering Biology PhD
    students at Hopkins: hands on learning to
    work at the UNIX command line, basic Python
    programming, using genomic data examples,
    followed by a year long weekly lab.

    View Slide

  6. CSHL Computational Genomics: Two week
    course that has been taught at CSHL for decades,
    covers a wide variety of topics in comparative
    and computational genomics, has used Galaxy
    since 2010, also covers some UNIX, R/RStudio,
    project based.
    Genomic Data Science: I teach one course of
    nine in the Genomic Data Science Coursera
    Specialization, a popular MOOC sequence
    covering many aspects of genomic data analysis.
    Penguin Coast: …

    View Slide

  7. 1. Rigor and Reproducibility

    View Slide

  8. What happens to traditional research outputs
    when an area of science rapidly become data
    intensive?

    View Slide

  9. Idea
    Experiment
    Raw Data
    Tidy Data
    Summarized data
    Results
    Experimental design
    Data collection
    Data cleaning
    Data analysis
    Inference
    Data Pipeline, inspired by Leek and Peng, Nature 2015
    The part we are
    considering here
    The part that
    ends up in the
    Publication

    View Slide

  10. Questions one might ask about a
    published analysis
    Is the analysis as described correct?
    Was the analysis performed as described?
    Can the analysis be re-created exactly?

    View Slide

  11. What is reproducibility?
    (for computational analyses)
    Reproducibility means that an analysis is
    described/captured in sufficient detail that it can
    be precisely reproduced
    Reproducibility is not provenance, reusability/
    generalizability, or correctness
    A minimum standard for evaluating analyses

    View Slide

  12. Microarray Experiment Reproducibility
    • 18 Nat. Genetics microarray gene expression
    experiments
    • Less than 50% reproducible
    • Problems
    • missing data (38%)
    • missing software, hardware details (50%)
    • missing method, processing details (66%)
    Ioannidis, J.P.A. et al. Repeatability of published microarray gene expression analyses. Nat Genet 41, 149-155
    (2009)

    View Slide

  13. http://dx.doi.org/10.1038/nrg3305

    View Slide

  14. NGS Re-sequencing Experiment Reproducibility

    View Slide

  15. NGS Re-sequencing Experiment Reproducibility

    View Slide

  16. • Consider a sample 50 papers from 2011 that
    used bwa for mapping (of ~380 published):
    • 36 did not provide primary data (all but 2
    provided it upon request)
    • 31 provide neither parameters nor
    versions
    • 19 provide settings only and 8 list versions
    only
    • Only 7 provide all details

    View Slide

  17. A core challenge of reproducibility is
    identifiability
    Given a methods section, can we actually
    identify the resources, data, software …
    that was actually used?

    View Slide

  18. Submitted 2 June 2013
    On the reproducibility of science: unique
    identification of research resources in the
    biomedical literature
    Nicole A. Vasilevsky1, Matthew H. Brush1, Holly Paddock2,
    Laura Ponting3, Shreejoy J. Tripathy4, Gregory M. LaRocca4 and
    Melissa A. Haendel1
    1 Ontology Development Group, Library, Oregon Health & Science University, Portland,
    OR, USA
    2 Zebrafish Information Framework, University of Oregon, Eugene, OR, USA
    3 FlyBase, Department of Genetics, University of Cambridge, Cambridge, UK
    4 Department of Biological Sciences and Center for the Neural Basis of Cognition,
    Carnegie Mellon University, Pittsburgh, PA, USA
    ABSTRACT
    Scientific reproducibility has been at the forefront of many news stories and there
    exist numerous initiatives to help address this problem. We posit that a contributor
    is simply a lack of specificity that is required to enable adequate research repro-
    ducibility. In particular, the inability to uniquely identify research resources, such
    as antibodies and model organisms, makes it diYcult or impossible to reproduce
    experiments even where the science is otherwise sound. In order to better understand
    the magnitude of this problem, we designed an experiment to ascertain the “iden-
    tifiability” of research resources in the biomedical literature. We evaluated recent
    journal articles in the fields of Neuroscience, Developmental Biology, Immunology,
    Cell and Molecular Biology and General Biology, selected randomly based on a
    diversity of impact factors for the journals, publishers, and experimental method
    reporting guidelines. We attempted to uniquely identify model organisms (mouse,
    rat, zebrafish, worm, fly and yeast), antibodies, knockdown reagents (morpholinos
    or RNAi), constructs, and cell lines. Specific criteria were developed to determine if
    a resource was uniquely identifiable, and included examining relevant repositories

    View Slide

  19. Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa; Iorns, Elizabeth (2014):
    Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare.
    http://dx.doi.org/10.6084/m9.figshare.987130
    32/127 tools
    6/41 papers

    View Slide

  20. #METHODSMATTER
    Figure 1
    0.480
    0.483
    0.486
    0.489
    0.492
    0.495
    0.498
    0.501
    0.504
    0.507
    0.510
    5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1
    Frequency Fluctuation for site 8992
    Default -n 3 -q 15 -n 3 -q 15
    (Nekrutenko and Taylor, Nature Reviews Genetics, 2012)

    View Slide

  21. Core reproducibility tasks
    1. Capture the precise description of the experiment
    (either as it is being carried out, or after the fact)
    2. Assemble all of the necessary data and software
    dependencies needed by the described experiment
    3. Combine the above to verify the analysis

    View Slide

  22. Recommendations for performing
    reproducible computational research
    Nekrutenko and Taylor, Nature Reviews Genetics, 2012
    Sandve, Nekrutenko, Taylor and Hovig, PLoS Computational Biology 2013

    View Slide

  23. 1. Accept that computation is an integral
    component of biomedical research. Familiarize
    yourself with best practices of scientific computing,
    and implement good computational practices in
    your group

    View Slide

  24. 2. Always provide access to raw primary data

    View Slide

  25. 3. Record versions of all auxiliary datasets used in
    analysis. Many analyses require data such as
    genome annotations from external databases that
    change regularly, either record versions or store a
    copy of the specific data used.

    View Slide

  26. 4. Store the exact versions of all software used.
    Ideally archive the software to ensure it can be
    recovered later.

    View Slide

  27. 5. Record all parameters, even if default values are
    used. Default settings can change over time and
    determining what those settings were later can
    sometimes be difficult.

    View Slide

  28. 6. Record and provide exact versions of any custom
    scripts used.

    View Slide

  29. 7. Do not reinvent the wheel, use existing software
    and pipelines when appropriate to contribute to the
    development of best practices.

    View Slide

  30. Is reproducibility achievable?

    View Slide

  31. A spectrum of solutions
    Analysis environments (Galaxy, GenePattern, …)
    Workflow systems (Taverna, Pegasus, VisTrails, …)
    Notebook style (Jupyter notebook, …)
    Literate programming style (Sweave/knitR, …)
    System level provenance capture (ReproZip, …)
    Complete environment capture (VMs, containers, …)

    View Slide

  32. 2.

    View Slide

  33. Goals
    Accessibility: Eliminate barriers for researchers
    wanting to use complex methods, make these
    methods available to everyone
    Transparency: Facilitate communication of analyses
    and results in ways that are easy to understand while
    providing all details
    Reproducibility: Ensure that analysis performed in
    the system can be reproduced precisely and
    practically

    View Slide

  34. Galaxy: accessible analysis system

    View Slide

  35. A free (for everyone) web service integrating a
    wealth of tools, compute resources, terabytes of
    reference data and permanent storage
    Open source software that makes integrating your
    own tools and data and customizing for your own
    site simple
    An open extensible platform for sharing tools,
    datatypes, workflows, ...

    View Slide

  36. Integrating existing tools into a uniform framework
    • Defined in terms of an abstract
    interface (inputs and outputs)
    • In practice, mostly
    command line tools, a
    declarative XML description
    of the interface, how to
    generate a command line
    • Designed to be as easy as
    possible for tool authors, while
    still allowing rigorous
    reasoning

    View Slide

  37. Galaxy analysis interface
    • Consistent tool user
    interfaces
    automatically
    generated
    • History system
    facilitates and tracks
    multistep analyses

    View Slide

  38. Automatically and transparently tracks
    every step of every analysis

    View Slide

  39. As well as user-generated
    metadata and annotation...

    View Slide

  40. Galaxy workflow system
    • Workflows can be
    constructed from scratch
    or extracted from existing
    analysis histories
    • Facilitate reuse, as well as
    providing precise
    reproducibility of a
    complex analysis

    View Slide

  41. View Slide

  42. Example: Workflow for differential expression analysis of RNA-seq using Tophat/
    Cufflinks tools

    View Slide

  43. View Slide

  44. Galaxy Pages for publishing analysis

    View Slide

  45. Actual histories and datasets directly accessible from the text

    View Slide

  46. Histories can be imported and the exact parameters inspected

    View Slide

  47. Describe analysis tool
    behavior abstractly
    Analysis environment automatically
    and transparently tracks details
    Workflow system for complex analysis,
    constructed explicitly or automatically
    Pervasive sharing, and publication
    of documents with integrated analysis

    View Slide

  48. Visualization and visual analytics

    View Slide

  49. 3. Teaching and Training

    View Slide

  50. Bioinformatics
    Learning Curve
    It’s a long hard climb
    when you are new at it.
    (Dave Clements)

    View Slide

  51. (Dave Clements)
    But some things can be
    avoided when you start:
    ● Linux install
    ● Linux Admin
    ● Command line
    ● Tool installs
    The Galaxy Goal:
    Focus on the questions and
    techniques, rather than on the
    compute infrastructure.

    View Slide

  52. View Slide

  53. Galaxy Training Network
    https://galaxyproject.org/teach/gtn/

    View Slide

  54. View Slide

  55. View Slide

  56. View Slide

  57. View Slide

  58. View Slide

  59. View Slide

  60. View Slide

  61. View Slide

  62. View Slide

  63. Separation between content and formatting
    Separation between content and formatting
    (Bérénice Batut)

    View Slide

  64. Materials organized by topic
    Topics for different targeted users
    (Bérénice Batut)

    View Slide

  65. Similar structure, content, formats
    Similar structure, content and formats
    (Bérénice Batut)

    View Slide

  66. Similar structure, content, formats
    Similar structure, content and formats
    (Bérénice Batut)

    View Slide

  67. 1.
    2.
    3.
    4.

    View Slide

  68. Similar structure, content, formats
    Similar structure, content and formats
    (Bérénice Batut)

    View Slide

  69. View Slide

  70. Docker
    Builds on Linux kernel features enabling complete
    isolation from the kernel level up
    Containers — lightweight environments with
    isolation enforced at the OS level, complete control
    over all software
    Adds a complete ecosystem for sharing, versioning,
    managing containers — Docker hub

    View Slide

  71. Galaxy + Docker
    Run every analysis in a clean container — analysis
    are isolated and environment is the same every time
    Archive that container — containers are lightweight
    thanks to layers — and the analysis can always be
    recreated

    View Slide

  72. View Slide

  73. Similar structure, content, formats
    Similar structure, content and formats
    (Bérénice Batut)

    View Slide

  74. Work-in-progress: Training materials include
    example workflows and test data; we now have
    tools to automatically test these workflows and
    verify the outputs are correct

    View Slide

  75. View Slide

  76. View Slide

  77. View Slide

  78. View Slide

  79. 4. Deployment Options

    View Slide

  80. Options for Galaxy Training include using existing
    instances (of which there are now many), deploying
    your own instance at an institution or course level,
    or having students deploy there own instances.
    All have strengths and weaknesses depending on
    the training goals

    View Slide

  81. PSC, Pittsburgh
    Stampede
    ● 462,462 cores
    ● 205 TB memory
    Blacklight
    Bridges
    Dedicated resources Shared XSEDE resources
    TACC
    Austin
    Galaxy Cluster 

    (Rodeo)
    ● 256 cores
    ● 2 TB memory
    Corral/Stockyard
    ● 20 PB disk
    funded by the National Science Foundation
    Award #ACI-1445604
    PTI IU Bloomington
    Galaxy main (http://usegalaxy.org): Leveraging National Cyberinfrastructure

    View Slide

  82. NSF Cloud for research and education
    Support the “long tail of science” where traditional
    HPC resources have not served well
    Jetstream will enable archival of volumes and virtual
    machines in the IU Scholarworks Archive with DOIs
    attached
    Exact analysis environment, data, and provenance
    becomes an archived, citable entity
    funded by the National Science Foundation
    Award #ACI-1445604

    View Slide

  83. 90+ Other public
    Galaxy Servers
    bit.ly/gxyServers

    View Slide

  84. Galaxy main is always under significant load, which
    can lead to relatively long wait times, especially for
    larger jobs (sequence mapping, et cetera).
    Can be fine for homework assignments and
    exercises completed offline, but difficult for in
    person training

    View Slide

  85. Galaxy’s workflow system is robust, flexible,
    and integrates with nearly any environment
    Install locally with many compute environments
    Deploy on a cloud using Cloudman
    Atmosphere

    View Slide

  86. CloudMan: General purpose deployment
    manager for any cloud. Cluster and service
    management, auto-scaling
    Cloudbridge: New abstraction library for
    working with multiple cloud APIs
    Genomics Virtual Lab: CloudMan + Galaxy +
    many other common bioinformatics tools
    and frameworks
    Galaxy Cloud

    View Slide

  87. Dedicated Galaxy instances on various Cloud environments

    View Slide

  88. Share a snapshot of this instance
    Complete instances can be archived and shared
    with others
    Reproducibility assistance from CloudMan

    View Slide

  89. View Slide

  90. View Slide

  91. View Slide

  92. All Cloudman and
    Galaxy Features
    available
    Nearly unlimited
    scalability
    Costs money, but
    education grants are
    available
    funded by the National Science Foundation
    Award #ACI-1445604
    All Cloudman and
    Galaxy Features
    available
    Fewer node types
    and available
    resources
    Educational
    allocations available
    through XSEDE
    (Also available on Google Compute and Azure but some
    features like autoscaling are still in development)

    View Slide

  93. Preparing Cloud Instances for Training (WIP)

    View Slide

  94. Similar structure, content, formats
    Similar structure, content and formats
    (Bérénice Batut)

    View Slide

  95. View Slide

  96. Acknowledgements
    Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave
    Bouvier, Martin Cěch, John Chilton, Dave Clements, Nate Coraor, Carl
    Eberhard, Jeremy Goecks, Björn Grüning, Sam Guerler, 

    Mo Heydarian, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric Rasche,
    Nicola Soranzo, Marius van den Beek
    Our lab: Enis Afgan, Dannon Baker, Boris Brenerman, Min Hyung Cho, Dave
    Clements, Peter DeFord, Sam Guerler, Nathan Roach, Michael E. G. Sauria,
    German Uritskiy
    Collaborators:

    Craig Stewart and the group

    Ross Hardison and the VISION group

    Victor Corces (Emory), Karen Reddy (JHU)

    Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology)

    Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics)
    NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620)

    NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103)
    funded by the National Science Foundation
    Award #ACI-1445604

    View Slide