Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GIGA2: Galaxy – a collaborative platform for accessible, transparent, and reproducible Genomics

GIGA2: Galaxy – a collaborative platform for accessible, transparent, and reproducible Genomics

Presented at the second workshop of the Global Invertebrate Genomics Alliance.

A remix of my early 2015 talks with a focus on how Galaxy supports collaboration (within and between Galaxy instances), and how it could be used to support a community of researchers collaborating on a set of genome projects – while sharing tools and workflows reproducibly.


James Taylor

March 23, 2015

More Decks by James Taylor

Other Decks in Science


  1. Galaxy – a collaborative platform for accessible, transparent, and reproducible

    Genomics @jxtx / #usegalaxy https://speakerdeck.com/jxtx
  2. A bit of what our lab is interested in... Genomics

    and Gene Regulation:  ▪ How is control of gene expression encoded in the genome?  ▪ How can we detect the elements involved?  ▪ How do they act in a coordinated way in the cell? ▪ How do they evolve?  Data intensive science:  ▪ How can we support increasingly data intensive and quantitatively complex science?  ▪ How can we improve the efficiency of scientific discovery  ▪ How can we improve the quality the resulting science?
  3. A continuing crisis in genomics research: reproducibility

  4. What is reproducibility? (for computational analyses) Reproducibility is not provenance,

    reusability/ generalizability, or correctness Reproducibility means that an analysis is described/captured in sufficient detail that it can be precisely reproduced Yet most published analyses are not reproducible 
 (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible; Nekrutenko and Taylor 2012, 7/50 resequencing experiments reproducible) Missing software, versions, parameters, data…
  5. Reproducibility Project: Cancer Biology Independently replicating 50 “high-impact” cancer studies

    from 2010-2012 (https://osf.io/e81xl/wiki/home/)
  6. Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa;

    Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers
  7. #METHODSMATTER Figure 1 0.480 0.483 0.486 0.489 0.492 0.495 0.498

    0.501 0.504 0.507 0.510 5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1 Frequency Fluctuation for site 8992 Default -n 3 -q 15 -n 3 -q 15 (Nekrutenko and Taylor, Nature Reviews Genetics, 2012)
  8. Galaxy’s motivating questions How best can data intensive methods be

    accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible?
  9. Galaxy: accessible analysis system

  10. A free (for everyone) web service integrating a wealth of

    tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  11. Integrating existing tools into a uniform framework • Defined in

    terms of an abstract interface (inputs and outputs) • In practice, mostly command line tools, a declarative XML description of the interface, how to generate a command line • Designed to be as easy as possible for tool authors, while still allowing rigorous reasoning
  12. Galaxy analysis interface • Consistent tool user interfaces automatically generated

    • History system facilitates and tracks multistep analyses
  13. Automatically and transparently tracks  every step of every analysis

  14. As well as user-generated  metadata and annotation...

  15. Galaxy workflow system • Workflows can be constructed from scratch

    or extracted from existing analysis histories • Facilitate reuse, as well as providing precise reproducibility of a complex analysis
  16. Example: Workflow for differential expression analysis of RNA-seq using Tophat/

    Cufflinks tools
  17. Collaboration within a Galaxy instance

  18. Everything in Galaxy can be shared or published

  19. Everything in Galaxy can be shared or published

  20. Data Libraries for organizing shared data

  21. …with role-based access controls

  22. Galaxy Pages for publishing analysis

  23. Actual histories and datasets directly accessible from the text

  24. Histories can be imported and the exact parameters inspected

  25. Workflows and other entities can also be embedded

  26. And imported for inspection, verification, and reuse

  27. The Galaxy ecosystem

  28. More than 70 known public Galaxy servers  15+ general

    servers Domain specific servers including: Ballaxy for structure based computational biology, Cistrome for regulatory sequence analysis, Genomic Hyperbrowser: statistical integration of genomic data, GigaGalaxy: integrating workflows published in GigaScience, Pathogen Portal:comparative analysis of host response to pathogens, ... Dozens of large scale private Galaxy instances
  29. None
  30. None
  31. None
  32. None
  33. None
  34. None
  35. None
  36. None
  37. None
  38. Ways to use Galaxy The public web service at http://usegalaxy.org

    Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere
  39. PSC, Pittsburgh SDSC, San Diego Galaxy Cluster • 256 cores

    • 2 TB memory Rodeo • 128 cores • 1 TB memory Corral/Stockyard • 20 PB disk Stampede • 462,462 cores • 205 TB memory Blacklight • 4,096 cores • 32 TB memory • Dedicated resources Trestles • 10,368 cores • 20.7 TB memory • Shared resources TACC Austin Nate Coraor Galaxy can scale: for example Galaxy main
  40. Blacklight (PSC) Messaging Server Pulsar Galaxy Server Processes Stampede (TACC)

    Galaxy Server VMs (TACC) Pulsar Job control (AMQP) Data transfer (HTTPS) Data transfer (HTTPS) John Chilton Pulsar: Galaxy job runner that can run almost anywhere. No shared filesystem, stages all necessary Galaxy components
  41. Bringing it all together: automate all the things! Unified ansible

    playbook for Galaxy main, cloud, and local deployments
  42. Collaboration among Galaxy instances

  43. 1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private

    clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed Greg von Kuster, Dave Bouvier
  44. None
  45. Repositories are owned by the contributor, can contain tools, workflows,

    etc. Backed by version control, a complete version history is retained for everything that passes through the toolshed Galaxy instance admins can install tools directly from the toolshed using only a web UI Support for recipes for installing the underlying software that tools depend on (also versioned)
  46. None
  47. None
  48. None
  49. None
  50. None
  51. None
  52. None
  53. Ensuring tool quality in the toolshed

  54. Correctness of all configurations, dependencies automatically verified  Contributed tools

    include functional tests which are run in a controlled environment Tool functional tests that passed An example of the information uploaded to a repository by the install_and_test_tool_shed_repositories test framework for tool functional tests that passed successfully.
  55. “Intergalactic Utilities Commission” rotating committee that reviews tools for qualitiy

    and provides sign-offs and ratings Reviewed repositories
  56. “devteam” tools

  57. “iuc” tools

  58. Galaxy toolshed summary • Allow users to share tools, datatypes,

    workflows, sample data, and automated installation scripts for tool dependencies • Version controlled • Community annotation, rating, comments, review • Dependency resolution • Integration with Galaxy instances to automate tool installation and updates
  59. Visualization and analytics Jeremy Goecks

  60. Why integrate visualization and analysis? Individual researchers are now producing

    their own reference genomes and functional genomic datasets: need to be easily able to create custom browsers Working with these datasets involves complex, parameter dependent analyses, interactive visualizations can aid in the analysis process Galaxy already provides a very sound model for abstracting interfaces to analysis tools Existing tool framework can be leveraged for  visual analytics Jeremy Goecks
  61. Trackster Entirely web standards based to support sharing, communicating, and

    collaborating around visualizations Dynamic and responsive Open source and extremely extensible Jeremy Goecks
  62. Dynamic filtering on element properties (here, FPKM for putative transcripts)

  63. Modifying Cufflinks parameters and locally reassembling

  64. Exploring parameter spaces efficiently Jeremy Goecks

  65. a b c d e f g ii iii i

  66. e f g ii iii i

  67. Custom build Visualize in trackster Easily build browsers and perform

    visual analysis for novel genomes
  68. Toward a pluggable framework interactive web-based visualizations...

  69. PhyloViz from Google Summer of Code student Tomithy Too

  70. Circster,interactive circos-style plots

  71. Visualization framework: Charts plugin

  72. Visualization framework: Charts plugin

  73. Visualization framework: Charts plugin

  74. Some current directions

  75. Supporting tool developers Planemo: command line tools to support Galaxy

    tool development tasks Support for github centric workflows in the toolshed to support collaborative tool development New approaches for dependency management and installation Citation, credit, and incentivation
  76. User interfaces for large-scale data analysis

  77. Galaxy’s user interface is designed to be simple and intuitive

    for users without informatics expertise Can we scale this user interface to the analysis of hundreds of samples while maintaining interface idioms and usability?
  78. Users typically use many histories when working with many samples;

    New multiple history view makes working with 100s of histories easy Carl Eberhard
  79. A not-so-new feature: mapping over multiple datasets However, this breaks

    down for complex combinations of datasets (e.g. many sets of paired end reads, in replicates)
  80. Dataset Collections Organize user data Individual Datasets Collection Collection Contents

    John Chilton and Carl Eberhard
  81. Operations over collections For “list” collections, existing tools can automatically

    be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools
  82. Map/reduce in workflows More Powerful Workflows Arbitrary # of Inputs

    (... paired). Run applications in parallel (one per input). Merged output for subsequent processing. John Chilton
  83. Enhanced Tuxedo Suite Workflow RNA-Seq workflow based using the Tuxedo

    suite. John Chilton
  84. Dataset Collections Extremely flexible for grouping collections of complex datasets,

    can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workflow scheduling improvements (backgrounding, decision points, streaming)
  85. Interactive programming environments Björn Grüning, Eric Rasche, John Chilton

  86. For researchers without informatics expertise, the web UI and existing

    tools are often sufficient For informaticians, Galaxy provides an extensive API and wrappers (e.g. Bioblend) But, many users can do some programming, would like the benefits of Galaxy with the flexibility to do some scripting
  87. Docker enables interactive———— environments Framework allows spinning up secure* isolated

    environments, that can interact with the Galaxy history through Galaxy’s API Initial implementation supporting iPython Notebook
  88. Example from John Chilton

  89. None
  90. None
  91. None
  92. Dan Blankenberg Nate Coraor Dannon Baker Jeremy Goecks Anton Nekrutenko

    James Taylor Dave Clements Jennifer Jackson Engineering Support and outreach Leadership Carl Eberhard Dave Bouvier John Chilton Sam Guerler Martin Čech Enis Afgan Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103), Penn State University, Emory University, and the Pennsylvania Department of Public Health Nitesh Turaga The “Core” Galaxy Team
  93. Björn Grüning Uni Freiburg Peter Cock TJHI Kyle Ellrott UCSC

    Eric Rasche CPT Nicola Soranzo TGAC Brad Chapman HSPH Nuwan Goonasekera VeRSI Yousef Kowsar VLSCI Extended team and other contributors… And many others who have contributed to the main Galaxy code, tools to the ToolShed, participated in discussions, attended the Galaxy conferences, …
  94. Galaxy is a community! Join us on irc, mailing lists,

    Galaxy Biostar Contribute code on bitbucket, github, or the ToolShed Join us for a Hackathon or our annual conference Fifth annual Galaxy Community Conference Hackathon, training day, and two days of talks