Talk for XSEDE day at Johns Hopkins

Slide 1

Slide 1 text

Galaxy Data intensive biology for everyone. www.galaxyproject.org @jxtx / #usegalaxy

Slide 2

Slide 2 text

I ❤ SEQUENCING! High-Throughput v

Slide 3

Slide 3 text

High-throughput sequencing is transformative

Slide 4

Slide 4 text

Resequencing De novo genome sequencing Direct RNA sequencing Open Chromatin assays (DNase, FAIRE) Transcription factors (ChIP-seq) Histones variants (ChIP-seq, MNase-seq) Long range interactions (5C, Hi-C, ChIA-PET Methylation (Bisulfite-seq)

Slide 5

Slide 5 text

High-throughput sequencing is democratizing

Slide 6

Slide 6 text

(http://omicsmaps.com/) It is widely available...

Slide 7

Slide 7 text

...and practically free! (NHGRI / Nature 497:546–547)

Slide 8

Slide 8 text

Making sense of this data requires sophisticated methods ! How can we ensure that these methods are accessible to researchers? ! ...while also ensuring that scientific results remain reproducible?

Slide 9

Slide 9 text

Galaxy: accessible analysis system

Slide 10

Slide 10 text

A free (for everyone) web service integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...

Slide 11

Slide 11 text

Describe analysis tool behavior abstractly Analysis environment automatically and transparently tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis

Slide 12

Slide 12 text

Visualization and visual analytics

Slide 13

Slide 13 text

The free service is still the easiest way for users with no informatics infrastructure to analyze their data ! How can we possibly sustain this?

Slide 14

Slide 14 text

ounded with a $1.5 million initial award, NCGAS dened and optimized genome analysis softw

Slide 15

Slide 15 text

New Data per Month (TB) 0 30 60 90 120 2008-04 2008-08 2008-12 2009-04 2009-08 2009-12 2010-04 2010-08 2010-12 2011-04 2011-08 2011-12 2012-04 2012-08 2012-12 2013-04 2013-08 usegalaxy.org data growth +128 cores for NGS/multicore jobs Data quotas implemented... Nate Coraor

Slide 16

Slide 16 text

usegalaxy.org frustration growth Jobs Deleted Before Run (% of submitted) 0% 2% 5% 7% 9% Total Jobs Completed (count) 0 40,000 80,000 120,000 160,000 2008-04 2008-08 2008-12 2009-04 2009-08 2009-12 2010-04 2010-08 2010-12 2011-04 2011-08 2011-12 2012-04 2012-08 2012-12 2013-04 2013-08 Nate Coraor

Slide 17

Slide 17 text

How can this possibly scale? ! 1. Leverage exisiting public cyber-infrastructure ! 2. Decentralize, provide many deployment models (cloud and local — not talking about this today)

Slide 18

Slide 18 text

Best place to build this robust entry point is clearly a national supercomputing center ! The Texas Advanced Computing Center (TACC) has already built substantial infrastructure in the context of the iPlant project ! (Including multi petabyte online storage, cloud infrastructure, collocated with some of the worlds largest HPC machines) ! However, the iPlant and TACC cyber-infrastructure was underused; thus we established a collaboration ! Since October 2013 Galaxy Main has run from TACC

Slide 19

Slide 19 text

Transparent Migrations using Galaxy’s Hierarchical Object Store Galaxy Server Processes Corral Corral Staging Penn State Read Data In Corral? In Staging? In PSU? Yes Yes Yes No No No Object Not Found Write Data Nate Coraor

Slide 20

Slide 20 text

Expanding to more XSEDE resources

Slide 21

Slide 21 text

Galaxy can already run jobs on almost any batch system, but most XSEDE resources do not provide direct access for job submission…

Slide 22

Slide 22 text

Pulsar ! Galaxy job runner that can run almost anywhere ! No shared filesystem, stages all necessary Galaxy components John Chilton

Slide 23

Slide 23 text

Blacklight (PSC) Messaging Server Pulsar Galaxy Server Processes Stampede (TACC) Galaxy Server VMs (TACC) Pulsar Job control (AMQP) Data transfer (HTTPS) Data transfer (HTTPS) Nate Coraor

Slide 24

Slide 24 text

Moving long running jobs out to XSEDE • Problem: • Jobs wait in the queue for a long time • Jobs may fail immediately upon run due to bad parameters • Most jobs run quickly! Can we relocate the long ones? ! • Goals: • Shorten wait from submission to start • Allow testing params without waiting ! • Solutions: • Set a short walltime, resubmit jobs to bigger resources (new code) • User selection of resources (Stampede - longer wait to start, but more concurrent jobs allowed) • Create “development” queues w/ short walltime Nate Coraor and John Chilton

Slide 25

Slide 25 text

State of Affairs • Today • Galaxy Test jobs to Stampede and Blacklight • Galaxy Main jobs to Stampede ! • Up next • Galaxy Main jobs to Blacklight • Optimize Trinity tools for Blacklight • Linking XSEDE allocations to Galaxy accounts

Slide 26

Slide 26 text

Credits • Texas Advanced Computing Center • Dan Stanzione • Matt Vaughn • Chris Jordan • Mike Packard • Nathaniel Mendoza ! • iPlant Collaborative • Stephen Goﬀ  • Pittsburgh Supercomputing Center • Philip Blood • Kathy Benninger • Robert Budden • Jared Yanovich • Josephine Palencia • J. Ray Scott • Joe Lappa ... and the Galaxy Team and community ! Galaxy is supported in part by NSF, NHGRI, Pennsylvania Department of Public Health, The Huck Institutes of the Life Sciences, The Institute for CyberScience at Penn State, and Johns Hopkins University

Slide 27

Slide 27 text

Dan Blankenberg Nate Coraor Dannon Baker Jeremy Goecks Anton Nekrutenko James Taylor Dave Clements Jennifer Jackson Engineering Support and outreach Leadership Carl Eberhard Dave Bouvier John Chilton Sam Guerler Martin Čech Enis Afgan Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103), Penn State University, Johns Hopkins University, and the Pennsylvania Department of Public Health Nick Stoler