Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GIGA2: Galaxy – a collaborative platform for accessible, transparent, and reproducible Genomics

GIGA2: Galaxy – a collaborative platform for accessible, transparent, and reproducible Genomics

Presented at the second workshop of the Global Invertebrate Genomics Alliance.

A remix of my early 2015 talks with a focus on how Galaxy supports collaboration (within and between Galaxy instances), and how it could be used to support a community of researchers collaborating on a set of genome projects – while sharing tools and workflows reproducibly.

James Taylor

March 23, 2015
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. Galaxy – a collaborative platform for accessible, transparent, and reproducible

    Genomics @jxtx / #usegalaxy https://speakerdeck.com/jxtx
  2. A bit of what our lab is interested in... Genomics

    and Gene Regulation:  ▪ How is control of gene expression encoded in the genome?  ▪ How can we detect the elements involved?  ▪ How do they act in a coordinated way in the cell? ▪ How do they evolve?  Data intensive science:  ▪ How can we support increasingly data intensive and quantitatively complex science?  ▪ How can we improve the efficiency of scientific discovery  ▪ How can we improve the quality the resulting science?
  3. What is reproducibility? (for computational analyses) Reproducibility is not provenance,

    reusability/ generalizability, or correctness Reproducibility means that an analysis is described/captured in sufficient detail that it can be precisely reproduced Yet most published analyses are not reproducible 
 (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible; Nekrutenko and Taylor 2012, 7/50 resequencing experiments reproducible) Missing software, versions, parameters, data…
  4. Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa;

    Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers
  5. #METHODSMATTER Figure 1 0.480 0.483 0.486 0.489 0.492 0.495 0.498

    0.501 0.504 0.507 0.510 5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1 Frequency Fluctuation for site 8992 Default -n 3 -q 15 -n 3 -q 15 (Nekrutenko and Taylor, Nature Reviews Genetics, 2012)
  6. Galaxy’s motivating questions How best can data intensive methods be

    accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible?
  7. A free (for everyone) web service integrating a wealth of

    tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  8. Integrating existing tools into a uniform framework • Defined in

    terms of an abstract interface (inputs and outputs) • In practice, mostly command line tools, a declarative XML description of the interface, how to generate a command line • Designed to be as easy as possible for tool authors, while still allowing rigorous reasoning
  9. Galaxy analysis interface • Consistent tool user interfaces automatically generated

    • History system facilitates and tracks multistep analyses
  10. Galaxy workflow system • Workflows can be constructed from scratch

    or extracted from existing analysis histories • Facilitate reuse, as well as providing precise reproducibility of a complex analysis
  11. More than 70 known public Galaxy servers  15+ general

    servers Domain specific servers including: Ballaxy for structure based computational biology, Cistrome for regulatory sequence analysis, Genomic Hyperbrowser: statistical integration of genomic data, GigaGalaxy: integrating workflows published in GigaScience, Pathogen Portal:comparative analysis of host response to pathogens, ... Dozens of large scale private Galaxy instances
  12. Ways to use Galaxy The public web service at http://usegalaxy.org

    Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere
  13. PSC, Pittsburgh SDSC, San Diego Galaxy Cluster • 256 cores

    • 2 TB memory Rodeo • 128 cores • 1 TB memory Corral/Stockyard • 20 PB disk Stampede • 462,462 cores • 205 TB memory Blacklight • 4,096 cores • 32 TB memory • Dedicated resources Trestles • 10,368 cores • 20.7 TB memory • Shared resources TACC Austin Nate Coraor Galaxy can scale: for example Galaxy main
  14. Blacklight (PSC) Messaging Server Pulsar Galaxy Server Processes Stampede (TACC)

    Galaxy Server VMs (TACC) Pulsar Job control (AMQP) Data transfer (HTTPS) Data transfer (HTTPS) John Chilton Pulsar: Galaxy job runner that can run almost anywhere. No shared filesystem, stages all necessary Galaxy components
  15. Bringing it all together: automate all the things! Unified ansible

    playbook for Galaxy main, cloud, and local deployments
  16. 1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private

    clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed Greg von Kuster, Dave Bouvier
  17. Repositories are owned by the contributor, can contain tools, workflows,

    etc. Backed by version control, a complete version history is retained for everything that passes through the toolshed Galaxy instance admins can install tools directly from the toolshed using only a web UI Support for recipes for installing the underlying software that tools depend on (also versioned)
  18. Correctness of all configurations, dependencies automatically verified  Contributed tools

    include functional tests which are run in a controlled environment Tool functional tests that passed An example of the information uploaded to a repository by the install_and_test_tool_shed_repositories test framework for tool functional tests that passed successfully.
  19. Galaxy toolshed summary • Allow users to share tools, datatypes,

    workflows, sample data, and automated installation scripts for tool dependencies • Version controlled • Community annotation, rating, comments, review • Dependency resolution • Integration with Galaxy instances to automate tool installation and updates
  20. Why integrate visualization and analysis? Individual researchers are now producing

    their own reference genomes and functional genomic datasets: need to be easily able to create custom browsers Working with these datasets involves complex, parameter dependent analyses, interactive visualizations can aid in the analysis process Galaxy already provides a very sound model for abstracting interfaces to analysis tools Existing tool framework can be leveraged for  visual analytics Jeremy Goecks
  21. Trackster Entirely web standards based to support sharing, communicating, and

    collaborating around visualizations Dynamic and responsive Open source and extremely extensible Jeremy Goecks
  22. Supporting tool developers Planemo: command line tools to support Galaxy

    tool development tasks Support for github centric workflows in the toolshed to support collaborative tool development New approaches for dependency management and installation Citation, credit, and incentivation
  23. Galaxy’s user interface is designed to be simple and intuitive

    for users without informatics expertise Can we scale this user interface to the analysis of hundreds of samples while maintaining interface idioms and usability?
  24. Users typically use many histories when working with many samples;

    New multiple history view makes working with 100s of histories easy Carl Eberhard
  25. A not-so-new feature: mapping over multiple datasets However, this breaks

    down for complex combinations of datasets (e.g. many sets of paired end reads, in replicates)
  26. Operations over collections For “list” collections, existing tools can automatically

    be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools
  27. Map/reduce in workflows More Powerful Workflows Arbitrary # of Inputs

    (... paired). Run applications in parallel (one per input). Merged output for subsequent processing. John Chilton
  28. Dataset Collections Extremely flexible for grouping collections of complex datasets,

    can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workflow scheduling improvements (backgrounding, decision points, streaming)
  29. For researchers without informatics expertise, the web UI and existing

    tools are often sufficient For informaticians, Galaxy provides an extensive API and wrappers (e.g. Bioblend) But, many users can do some programming, would like the benefits of Galaxy with the flexibility to do some scripting
  30. Docker enables interactive———— environments Framework allows spinning up secure* isolated

    environments, that can interact with the Galaxy history through Galaxy’s API Initial implementation supporting iPython Notebook
  31. Dan Blankenberg Nate Coraor Dannon Baker Jeremy Goecks Anton Nekrutenko

    James Taylor Dave Clements Jennifer Jackson Engineering Support and outreach Leadership Carl Eberhard Dave Bouvier John Chilton Sam Guerler Martin Čech Enis Afgan Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103), Penn State University, Emory University, and the Pennsylvania Department of Public Health Nitesh Turaga The “Core” Galaxy Team
  32. Björn Grüning Uni Freiburg Peter Cock TJHI Kyle Ellrott UCSC

    Eric Rasche CPT Nicola Soranzo TGAC Brad Chapman HSPH Nuwan Goonasekera VeRSI Yousef Kowsar VLSCI Extended team and other contributors… And many others who have contributed to the main Galaxy code, tools to the ToolShed, participated in discussions, attended the Galaxy conferences, …
  33. Galaxy is a community! Join us on irc, mailing lists,

    Galaxy Biostar Contribute code on bitbucket, github, or the ToolShed Join us for a Hackathon or our annual conference Fifth annual Galaxy Community Conference Hackathon, training day, and two days of talks