Slide 1

Slide 1 text

@jxtx / #usegalaxy https://speakerdeck.com/jxtx

Slide 2

Slide 2 text

…in which I will not talk about the elephant whale in the room…

Slide 3

Slide 3 text

Galaxy’s motivating questions How best can data intensive methods be accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible*? *The state of which is frighteningly bad, see doi:10.1038/nrg3305, doi:10.7717/peerj.148

Slide 4

Slide 4 text

Galaxy: accessible analysis system

Slide 5

Slide 5 text

A free (for everyone) web service integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...

Slide 6

Slide 6 text

Describe analysis tool behavior abstractly Analysis environment automatically and transparently tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis

Slide 7

Slide 7 text

Visualization and visual analytics

Slide 8

Slide 8 text

Wait… what… a free web-service for high throughput sequence data analysis?!

Slide 9

Slide 9 text

Galaxy as a Service Public web site that anyone can use for free ~1,200 new users, ~20 TB of user data uploaded, and ~180,000 analysis jobs per month Since 2010, disk quotas (250Gb per user) and compute limits (4 concurrent analyses) http://usegalaxy.org

Slide 10

Slide 10 text

ounded with a $1.5 million initial award, NCGAS dened and optimized genome analysis softw

Slide 11

Slide 11 text

New Data per Month (TB) 0 30 60 90 120 2008-04 2008-08 2008-12 2009-04 2009-08 2009-12 2010-04 2010-08 2010-12 2011-04 2011-08 2011-12 2012-04 2012-08 2012-12 2013-04 2013-08 usegalaxy.org data growth +128 cores for NGS/multicore jobs Data quotas implemented... Nate Coraor

Slide 12

Slide 12 text

usegalaxy.org frustration growth Jobs Deleted Before Run (% of submitted) 0% 2% 5% 7% 9% Total Jobs Completed (count) 0 40,000 80,000 120,000 160,000 2008-04 2008-08 2008-12 2009-04 2009-08 2009-12 2010-04 2010-08 2010-12 2011-04 2011-08 2011-12 2012-04 2012-08 2012-12 2013-04 2013-08 Nate Coraor

Slide 13

Slide 13 text

Scaling plan one: Decentralize!

Slide 14

Slide 14 text

Local Galaxy Deployment Galaxy is designed for local installation and customization... just download and run Pluggable interfaces to compute resources, easily connect to one or more existing clusters Ideally, allow users to take advantage of whatever computational resources they already have access to.

Slide 15

Slide 15 text

More than 60 known public Galaxy servers  Ballaxy for structure based computational biology, Cistrome for regulatory sequence analysis, Genomic Hyperbrowser: statistical integration of genomic data, GigaGalaxy: integrating workflows published in GigaScience, Pathogen Portal:comparative analysis of host response to pathogens, ... Dozens of large scale private Galaxy instances

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

Welcome to Galaxy on the Cloud(s)

Slide 22

Slide 22 text

CloudMan: a general purpose deployment manager for ANY cloud Enis Afgan, Dannon Baker Atmosphere

Slide 23

Slide 23 text

Enis Afgan, Dannon Baker

Slide 24

Slide 24 text

That was a great plan! …but users still want one easy to use gateway

Slide 25

Slide 25 text

Scaling plan two: beg, borrow, steal!

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

Best place to build this robust entry point is clearly a national supercomputing center The Texas Advanced Computing Center (TACC) has already built substantial infrastructure in the context of the iPlant project (Including multi petabyte online storage, cloud infrastructure, collocated with some of the worlds largest HPC machines) However, the iPlant and TACC cyber-infrastructure was underused; thus we established a collaboration Since October 2013 Galaxy Main has run from TACC

Slide 28

Slide 28 text

Still not enough!

Slide 29

Slide 29 text

Blacklight (PSC) Messaging Server Pulsar Galaxy Server Processes Stampede (TACC) Galaxy Server VMs (TACC) Pulsar Job control (AMQP) Data transfer (HTTPS) Data transfer (HTTPS) John Chilton Pulsar: Galaxy job runner that can run almost anywhere. No shared filesystem, stages all necessary Galaxy components

Slide 30

Slide 30 text

PSC, Pittsburgh SDSC, San Diego Galaxy Cluster ● 256 cores ● 2 TB memory Rodeo ● 128 cores ● 1 TB memory Corral/Stockyard ● 20 PB disk Stampede ● 462,462 cores ● 205 TB memory Blacklight ● 4,096 cores ● 32 TB memory ● Dedicated resources Trestles ● 10,368 cores ● 20.7 TB memory ● Shared resources TACC Austin Nate Coraor

Slide 31

Slide 31 text

Nate Coraor

Slide 32

Slide 32 text

Result: No waiting for jobs to run on usegalaxy.org! (for now…)

Slide 33

Slide 33 text

Bringing it all together: automate all the things! Unified ansible playbook for Galaxy main, cloud, and local deployments

Slide 34

Slide 34 text

Dannon Baker Dan Blankenberg Dave Bouvier http://wiki.galaxyproject.org/GalaxyTeam Enis Afgan Marten Čech Nate Coraor Carl Eberhard Jeremy Goecks Ross Lazarus Anton Nekrutenko James Taylor The Galaxy Team Jen Jackson Sam Guerler Dave Clements John Chilton

Slide 35

Slide 35 text

23 Logos ional Logos ne the university or Hopkins name and the nterdisciplinary center retain distinctive mmon shield shape. iversity logo. Vertical ptable. Primary school alth ces d International Studies olor, clear space, size, Biomedical Engineering Joel Bader Mike Beer Rachel Karchin Steven Salzberg Computer Science Alexis Battle Ben Langmead Suchi Saria Applied Math Don Geman Oncology Elana Fertig Luigi Marchionni Robert Scharpf Sarah Wheelan Medicine Lilian Florea Mihaela Pertea Jiang Qian Biostatistics Kasper Hansen Hongkai Ji Jeff Leek Ingo Ruczinski Cristian Tomasetti Biology James Taylor Computational Biology, Genomics, and Bioinformatics at Johns Hopkins University http://ccb.jhu.edu

Slide 36

Slide 36 text

Bloomberg Distinguished Professorship in Evolutionary Genomics. The Johns Hopkins University is searching for an outstanding senior scientist in the area of Evolutionary Genomics for an endowed chair as a Bloomberg Distinguished Professor. This position will be held jointly between the Department of Biology (Krieger School of Arts and Sciences) and the Institute for Genetic Medicine (JHU School of Medicine). Tenure-Track Faculty Position in Data Intensive Biology
 The Department of Biology seeks to hire a tenure-track Assistant Professor who applies data intensive approaches to investigate biological problems in creative and innovative ways… Candidates who apply computational, quantitative, or data intensive methods in any area of Biology will be considered… More Info: http://www.bio.jhu.edu/Events/Jobs/Default.aspx Or contact me: [email protected]

Slide 37

Slide 37 text

John Chilton