Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Adventures in Scaling Galaxy at Biological Data Science 2014

James Taylor
November 07, 2014

Adventures in Scaling Galaxy at Biological Data Science 2014

Presented at the first Cold Spring Harbor Meeting on Biological Data Science: http://meetings.cshl.edu/meetings/2014/data14.shtml


We have developed and continue to support the Galaxy genomic analysis system. Galaxy integrates existing computational tools within a framework that makes it easy for non-experts to perform reproducible analyses on large datasets.

Our main public Galaxy analysis website currently supports more than 30,000 users performing hundreds of thousands of analysis jobs every month. Many academic and commercial institutions around the world operate private Galaxy instances. Our efforts so far have been focused on the development of software that enables any biological researcher to perform complex computational analyses by hiding technical complexities associated with management of underlying programs and high-performance compute infrastructure.

The success of Galaxy has led to a variety of challenges. Meeting the compute demands associated with the main instance of Galaxy has been a significant ongoing effort. Beyond just raw computational complexity, scaling the system to deal with increasingly complex analysis has also created many challenges. Here I will discuss our ongoing efforts to maximize the number and variety of compute platforms that Galaxy can integrate with, along with other new features to help Galaxy scale in various dimensions.

More information on the Galaxy project is available at http://galaxyproject.org, and you can start using Galaxy now at http://usegalaxy.org.

James Taylor

November 07, 2014

More Decks by James Taylor

Other Decks in Science


  1. @jxtx / #usegalaxy https://speakerdeck.com/jxtx

  2. …in which I will not talk about the elephant whale

    in the room…
  3. Galaxy’s motivating questions How best can data intensive methods be

    accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible*? *The state of which is frighteningly bad, see doi:10.1038/nrg3305, doi:10.7717/peerj.148
  4. Galaxy: accessible analysis system

  5. A free (for everyone) web service integrating a wealth of

    tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  6. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis
  7. Visualization and visual analytics

  8. Wait… what… a free web-service for high throughput sequence data

  9. Galaxy as a Service Public web site that anyone can

    use for free ~1,200 new users, ~20 TB of user data uploaded, and ~180,000 analysis jobs per month Since 2010, disk quotas (250Gb per user) and compute limits (4 concurrent analyses) http://usegalaxy.org
  10. ounded with a $1.5 million initial award, NCGAS dened and

    optimized genome analysis softw
  11. New Data per Month (TB) 0 30 60 90 120

    2008-04 2008-08 2008-12 2009-04 2009-08 2009-12 2010-04 2010-08 2010-12 2011-04 2011-08 2011-12 2012-04 2012-08 2012-12 2013-04 2013-08 usegalaxy.org data growth +128 cores for NGS/multicore jobs Data quotas implemented... Nate Coraor
  12. usegalaxy.org frustration growth Jobs Deleted Before Run (% of submitted)

    0% 2% 5% 7% 9% Total Jobs Completed (count) 0 40,000 80,000 120,000 160,000 2008-04 2008-08 2008-12 2009-04 2009-08 2009-12 2010-04 2010-08 2010-12 2011-04 2011-08 2011-12 2012-04 2012-08 2012-12 2013-04 2013-08 Nate Coraor
  13. Scaling plan one: Decentralize!

  14. Local Galaxy Deployment Galaxy is designed for local installation and

    customization... just download and run Pluggable interfaces to compute resources, easily connect to one or more existing clusters Ideally, allow users to take advantage of whatever computational resources they already have access to.
  15. More than 60 known public Galaxy servers  Ballaxy for

    structure based computational biology, Cistrome for regulatory sequence analysis, Genomic Hyperbrowser: statistical integration of genomic data, GigaGalaxy: integrating workflows published in GigaScience, Pathogen Portal:comparative analysis of host response to pathogens, ... Dozens of large scale private Galaxy instances
  16. None
  17. None
  18. None
  19. None
  20. None
  21. Welcome to Galaxy on the Cloud(s)

  22. CloudMan: a general purpose deployment manager for ANY cloud Enis

    Afgan, Dannon Baker Atmosphere
  23. Enis Afgan, Dannon Baker

  24. That was a great plan! …but users still want one

    easy to use gateway
  25. Scaling plan two: beg, borrow, steal!

  26. None
  27. Best place to build this robust entry point is clearly

    a national supercomputing center The Texas Advanced Computing Center (TACC) has already built substantial infrastructure in the context of the iPlant project (Including multi petabyte online storage, cloud infrastructure, collocated with some of the worlds largest HPC machines) However, the iPlant and TACC cyber-infrastructure was underused; thus we established a collaboration Since October 2013 Galaxy Main has run from TACC
  28. Still not enough!

  29. Blacklight (PSC) Messaging Server Pulsar Galaxy Server Processes Stampede (TACC)

    Galaxy Server VMs (TACC) Pulsar Job control (AMQP) Data transfer (HTTPS) Data transfer (HTTPS) John Chilton Pulsar: Galaxy job runner that can run almost anywhere. No shared filesystem, stages all necessary Galaxy components
  30. PSC, Pittsburgh SDSC, San Diego Galaxy Cluster • 256 cores

    • 2 TB memory Rodeo • 128 cores • 1 TB memory Corral/Stockyard • 20 PB disk Stampede • 462,462 cores • 205 TB memory Blacklight • 4,096 cores • 32 TB memory • Dedicated resources Trestles • 10,368 cores • 20.7 TB memory • Shared resources TACC Austin Nate Coraor
  31. Nate Coraor

  32. Result: No waiting for jobs to run on usegalaxy.org! (for

  33. Bringing it all together: automate all the things! Unified ansible

    playbook for Galaxy main, cloud, and local deployments
  34. Dannon Baker Dan Blankenberg Dave Bouvier http://wiki.galaxyproject.org/GalaxyTeam Enis Afgan Marten

    Čech Nate Coraor Carl Eberhard Jeremy Goecks Ross Lazarus Anton Nekrutenko James Taylor The Galaxy Team Jen Jackson Sam Guerler Dave Clements John Chilton
  35. 23 Logos ional Logos ne the university or Hopkins name

    and the nterdisciplinary center retain distinctive mmon shield shape. iversity logo. Vertical ptable. Primary school alth ces d International Studies olor, clear space, size, Biomedical Engineering Joel Bader Mike Beer Rachel Karchin Steven Salzberg Computer Science Alexis Battle Ben Langmead Suchi Saria Applied Math Don Geman Oncology Elana Fertig Luigi Marchionni Robert Scharpf Sarah Wheelan Medicine Lilian Florea Mihaela Pertea Jiang Qian Biostatistics Kasper Hansen Hongkai Ji Jeff Leek Ingo Ruczinski Cristian Tomasetti Biology James Taylor Computational Biology, Genomics, and Bioinformatics at Johns Hopkins University http://ccb.jhu.edu
  36. Bloomberg Distinguished Professorship in Evolutionary Genomics. The Johns Hopkins University

    is searching for an outstanding senior scientist in the area of Evolutionary Genomics for an endowed chair as a Bloomberg Distinguished Professor. This position will be held jointly between the Department of Biology (Krieger School of Arts and Sciences) and the Institute for Genetic Medicine (JHU School of Medicine). Tenure-Track Faculty Position in Data Intensive Biology
 The Department of Biology seeks to hire a tenure-track Assistant Professor who applies data intensive approaches to investigate biological problems in creative and innovative ways… Candidates who apply computational, quantitative, or data intensive methods in any area of Biology will be considered… More Info: http://www.bio.jhu.edu/Events/Jobs/Default.aspx Or contact me: [email protected]
  37. John Chilton