Data intensive biology in the cloud: instrumenting ALL the things by Titus Brown

Instrument ALL the things: Studying data-‐intensive workﬂows in
the clowd. C. Titus Brown Michigan State University (See blog post)

A few upfront deﬁniGons Big Data, n: whatever is
s"ll inconvenient to compute on. Data scienGst, n: a staGsGcian who lives in San Francisco. Professor, n: someone who writes grants to fund people who do the work (c.f. Fernando Perez) I am a professor (not a data scien"st) who writes grants so that others can do data-‐ intensive biology.

This talk dedicated to Terry Peppers Titus, I no
longer understand what you actually do… Daddy, what do you do at work!?

I assemble puzzles for a living. Well, ok, I
strategize about solving mulG-‐dimensional puzzles with billions of pieces and no box.

Three bioinformaGc strategies in use •  Greedy: “if the
piece sorta ﬁts…” •  N2 – “Do these two pieces match? How about this next one?” •  The Dutch approach.

The Dutch SoluGon (De Bruijn assembly) Find similariGes
within puzzle pieces

The Dutch SoluGon Algorithmically: •  Is linear in
Gme with number of pieces J (Way beZer than N2!) •  Is linear in memory with volume of data L (This is due to errors in digiGzaGon process.)

PracGcal memory measurements Velvet measurements (Adina Howe) GB
RAM (About $500 of data)

Our research challenges – 1.  It costs only $10k
& 1 week to generate enough sequence data that no commodity computer (and few supercomputers) can assemble it. 2.  Hundreds -‐> thousands of such data sets are being generated each year.

Our research (i) -‐ CS •  Streaming lossy compression
approach that discards pieces we’ve seen before. •  Low memory probabilisGc data structures. (…see Pycon 2013 talk) => RAM now scales beZer: O(I) where I << N (I is sample dependent but typically I < N/20)

Our research (ii) -‐ approach •  Open source, open
data, open science, and reproducible computaGonal research. – GitHub – Automated tesGng, CI, & literate reSTing – Blogging, TwiZer – IPython Notebook for data analysis, ﬁgures. •  Protocols for assembling in the cloud.

Molgula oculata Molgula occulta Molgula oculata Real
soluGons, tackling squishy biology! Elijah Lowe & Billie Swalla

Doing things right => #awesomesauce Protocols in English
for running analyses in the cloud Literate reSTing => shell scripts Tool competitions Benchmarking Education Acceptance tests

Benchmarking strategy •  Rent a bunch of cloud VMs
from Amazon and Rackspace. •  Extract commands from tutorials using literate-‐resGng. •  Use ‘sar’ (sysstat pkg) to sample CPU, RAM, and disk I/O.

Benchmarking output Data subset; AWS m1.xlarge

Each protocol has many steps Data subset; AWS m1.xlarge

Most interested in RAM-‐intensive bit Data subset; AWS m1.xlarge

Most interested in RAM-‐intensive bit Complete data; AWS m1.xlarge

ObservaGon #1: Rackspace is faster machine data disk
working hours cost rackspace-‐15gb 200 GB 100 GB 34.9 $23.70 m2.xlarge EBS ephemeral 44.7 $18.34 m1.xlarge EBS ephemeral 45.5 $21.82 m1.xlarge EBS, max IOPS ephemeral 49.1 $23.56 m1.xlarge EBS, max IOPS EBS, max IOPS 52.5 $25.20

Surprise #1: AWS ephemeral storage is FASTER machine
data disk working hours cost rackspace-‐15gb 200 GB 100 GB 34.9 $23.70 m2.xlarge EBS ephemeral 44.7 $18.34 m1.xlarge EBS ephemeral 45.5 $21.82 m1.xlarge EBS, max IOPS ephemeral 49.1 $23.56 m1.xlarge EBS, max IOPS EBS, max IOPS 52.5 $25.20

ObservaGon #2: NUMA costs Same task done with varying
memory sizes.

Can’t we just use a faster computer? •  Demo
data on m1.xlarge: 2789 s •  Demo data on m3.xlarge: 1970 s – 30% faster! (Why? m3.xlarge has 2x40 GB SSD drives & 40% faster cores.) Great! Let’s try it out!

ObservaGon #3: mulGfaceted problem! •  Full data on m1.xlarge:
45.5 h •  Full data on m3.xlarge: out of disk space. We need about 200 GB to run the full pipeline. You can have fast disk or lots of disk but not both, for the moment.

Future direcGons 1.  Invest in cache-‐local data structures and
algorithms. 2.  Invest in streaming/in-‐memory approaches. 3.  Not clear (to me) that straight code opGmizaGon or infrastructure engineering is worthwhile investment.

Frequently Oﬀered SoluGons 1.  You should like, totally mulGthread
that. (See: McDonald & Brown, POSA) 2.  Hadoop will just crush that workload, dude. (Unlikely to be cost-‐eﬀecGve.) 3.  Have you tried <my proprietary Big Data technology stack>? (Thatz Not Science)

OpGmizaGon vs scaling •  Linear Gme/memory improvements would not
have addressed our core problem. (2 years, 20x improvement, 100x increase in data.) •  Puzzle problem is a graph problem with big data, no locality, small compute. Not friendly. •  We need(ed) to scale our algorithms. •  Can now run on single-‐chassis, in ~15 GB RAM.

OpGmizaGon vs scaling -‐-‐ 10 0 1 2 3
4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 Size of problem Compute resources (abstract)

Scaling can be more important! 100 0 10 20
30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 Size of problem Compute resources (abstract)

What are we losing by focusing our engineering on
pleasantly parallel problems? •  Hadoop is fundamentally not that interes"ng. •  Research is about the 100x. •  Scaling new problems, evaluaGng/creaGng new data structures and algorithms, etc.

(From my PyCon 2011 talk.) Theme: Life’s too short
to tackle the easy problems – come to academia! 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Resources ($$, etc.) Parallelizability Awesome stuff to research Easy stuff like Google Search

Thanks! •  Leigh Sheneman, for starGng the benchmarking
project. •  Labbies: Michael R. Crusoe, Luiz Irber, Likit Preeyanon, Camille ScoZ, and Qingpeng Zhang.

Thanks! •  github.com/ged-‐lab/ –  khmer – core project
–  khmer-‐protocols – tutorials/acceptance tests –  literate-‐resGng – script to pull out code from reST tutorials •  Blog post at: hZp://ivory.idyll.org/blog/2014-‐pycon.html •  Michael R. Crusoe, Likit Preeyanon, Camille ScoZ, and Qingpeng Zhang are here at PyCon. …note, you can probably aﬀord to buy them oﬀ me :)

Diﬀerent computaGonal strategies for k-‐mer counGng, revealed! Khmer-‐counGng
paper pipeline; Qingpeng Zhang

Data intensive biology in the cloud: instrument...

Data intensive biology in the cloud: instrumenting ALL the things by Titus Brown

More Decks by PyCon 2014

Other Decks in Science

Featured

Transcript