Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Galaxy: Beyond the Genome 2011

James Taylor
September 01, 2011
45

Galaxy: Beyond the Genome 2011

alaxy talk given at the second Beyond The Genome meeting hosted by Genome Biology and Genome Medicine. Two big advances here. First, visual analytics analysis using cufflinks as an example. Second, the ToolShed -- an early working version is featured heavily in this talk. I gave an embarrassing number of variations on this slide deck through 2011 and 2012.

James Taylor

September 01, 2011
Tweet

Transcript

  1. Accessible, transparent, reproducible analysis with Galaxy http://usegalaxy.org | http://getgalaxy.org Dan

    Blankenberg Nate Coraor Greg von Kuster Enis Afgan Dannon Baker Jeremy Goecks Anton Nekrutenko James Taylor Dave Clements Jennifer Jackson @jxtx / #usegalaxy Supported by the NHGRI (HG005542, HG004909, HG005133), NSF (DBI-0850103), Penn State University, Emory University, and the Pennsylvania Department of Public Health
  2. Investigators across nearly all areas of biology can take advantage

    of high- throughput sequencing Data production moving from large community projects to individual labs The “democratization of sequencing“ has not yet been matched by democratization of analysis infrastructure, burden is largely on the investigator However, making sense of this data requires sophisticated methods
  3. How can these methods be made accessible to scientists? How

    do we facilitate transparent communication of analyses? How do we ensure that analyses are reproducible?
  4. Setting a standard for providing detailed methods... (But most papers

    citing the 1000 genomes don’t use their methods)
  5. What is Galaxy? •A free (for everyone) web service integrating

    a wealth of tools, compute resources, terabytes of reference data and permanent storage •Open source software that makes integrating your own tools and data and customizing for your own site simple
  6. Integrating existing tools into a uniform framework •Defined in terms

    of an abstract interface (inputs and outputs) •In practice, mostly command line tools, a declarative XML description of the interface, how to generate a command line •Designed to be as easy as possible for tool authors, while still allowing rigorous reasoning
  7. Galaxy workflow system •Workflows can be constructed from scratch or

    extracted from existing analysis histories •Facilitate reuse, as well as providing precise reproducibility of a complex analysis
  8. Sharing and publishing • All analysis components (datasets, histories, workflows)

    can be shared among Galaxy users and published • Pages and annotation allow analaysis to be augmented with textual content and provided in the form of an integrated document
  9. •Galaxy's publishing features facilitate access and reproducibility without any extra

    leg work •One click grants access to the actual analysis you performed to generate your original results •Not just data access: the full pipeline •Annotate each step •Anyone can import your work and immediately reproduce or build on it The power of Galaxy publishing
  10. Deploying Galaxy: Galaxy main site (http://usegalaxy.org) •Public web site, anybody

    can use •~500 new users per month, 100s of TB of user data, ~150,000 analysis jobs per month, every month is our busiest month ever... •Will continue to be maintained and enhanced, but with limits and quotas •Centralized solution cannot scale to meet data analysis demands
  11. Local Galaxy instances (http://getgalaxy.org) •Galaxy is designed for local installation

    and customization •Just download and run, completely self-contained •Easily integrate new tools •Easy to deploy and manage on nearly any (unix) system •Integrate with production database ( ) and web servers ( , ) •Run jobs on existing clusters ( , , )
  12. Galaxy Cloud (http://usegalaxy.org/cloud) •On-demand resource acquisition fits well with the

    irregular resource needs of many labs working with sequence data •Our goal is to approach the ease of use of a “software as a service” solution while maintaining the flexibility and control of an infrastructure based solution
  13. Can use like any other Galaxy instance, with additional compute

    nodes acquired and released (automatically) in response to usage
  14. Tool installation and configuration, image creation, etc, all completely automated

    and extensible Cloud instances include all tools available in main Galaxy and more Same automation approach can be used for configuring tool dependencies for a local Galaxy VM image with just tools available, currently at http://s3.amazonaws.com/usegalaxy/UseGalaxy.ova
  15. Many researchers are now producing their own reference genomes and

    functional genomic datasets: need to be easily able to create custom browsers Working with these datasets involves complex, parameter dependent analyses, interactive visualizations can aid in the analysis process
  16. Visualization and analytics: Galaxy Track Browser Entirely web standards based

    to support sharing, communicating, and collaborating around visualizations Dynamic and responsive Open source and extremely extensible
  17. Track browser details •Client-side rendering using HTML5 standard components, supports

    responsiveness and dynamism •Datasets aggregated or indexed on the server transparently using Galaxy’s existing dataset conversion functionality •Multiple reduced representations supported for and datatype / display, detailed data acquired progressively
  18. Modular and extensible •Data exchange formats use simple json encodings,

    not tied to Galaxy datatypes •Client UI (browser) components not tied to Galaxy UI •Rendering components not even tied to the browser, CommonJS modules that can be reused in any JS environment •A browser is self contained, can be embedded in other web applications, pages •Displays, track types, element rendering, ... all extensible
  19. With increasingly complex tools, more experimentation with parameters is necessary,

    visual feedback aids exploration Galaxy already provides a very sound model for abstracting interfaces to analysis tools Existing tool framework can be leveraged for visual analytics
  20. Arbitrary visualization types supported (but not implemented) Access to tools

    and visual analytics specific features (e.g. local computation using global models) can be used by new visualization types
  21. Local and global analysis •Visual analytics supported for any existing

    tool that produces genomic data •Attempts to run the tool only on the visible subset of data, user can explicately run genome-wide •For tools that require a global model, allows the global model to be reused •For example, here cufflinks has been modified to save global information and reuse for the local computations
  22. Scaling Galaxy: two distinct problems •So much data, not enough

    infrastructure. •Solution, encourage local Galaxy instances, cloud Galaxy, support increasingly decentralized model, improve access to exiting resources •So many tools and workflows, not enough manpower •Focus on building infrastructure to allow community to integrate and share tools, workflows, and best practices
  23. Galaxy toolshed vision •Allow users to share “suites” containing tools,

    datatypes, workflows, sample data, and automated installation scripts for tool dependencies •Version controlled •Community annotation, rating, comments, review •Dependency resolution •Integration with Galaxy instances to automate tool installation and updates
  24. 1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private

    clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed
  25. Some more future directions •Capturing and automatically deploying tool dependencies,

    automatic tool acquisition in Galaxy instances •Better interfaces for highly parallel analysis (e.g. running the same workflow across thousands of individuals)
  26. Some more future directions •Workflow engine improvements, partial data streaming,

    combined experimental/computational workflows, robustness •Supporting multiple parallelization models simultaneously in the same workflow (not everything can be expressed well in any one model) •Computing over distributed data
  27. Develop and deploy: http://getgalaxy.org Try it now: http://usegalaxy.org Join us,

    contact me at: [email protected] Opportunities for collaboration, positions for postdocs, researchers, software engineers http://galaxyproject.org