Slide 1

Slide 1 text

Chris Cope and Jim DeLeo SciPy 2014 Building petabyte-scale comparative genomics pipelines

Slide 2

Slide 2 text

NOT A BIOLOGIST Disclaimer:

Slide 3

Slide 3 text

Mission Empower citizen scientists to analyze the many petabytes of public genomic data ● Cloud ● Internet2 ● Hide the plumbing (mrjob) and Sequitur algorithm for genomics

Slide 4

Slide 4 text

Biology stuff ● Next-gen sequencing ● Alignment ● File formats ○ FASTA ○ BAM / SAM ● NIH/NLM/NCBI More details on the internets

Slide 5

Slide 5 text

Genomics, why?

Slide 6

Slide 6 text

Sequence Alignment Map

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Sequitur algorithm

Slide 10

Slide 10 text

resilientscience.github.io/wot WOT WOT?

Slide 11

Slide 11 text

Comparative genomics Lexical analysis ● alignment ● “grep” Syntactic analysis ● grammar ● CFG comparison CTGACTGTCGACCTCACGAAGTCCGCCGTAAGC CTGACACCCGACCTCGCGAAGTCCGCCAAGCTC

Slide 12

Slide 12 text

Computing stuff ● AWS (Cloud, MapReduce) ● National Library of Medicine ● Internet2 ● Grammars

Slide 13

Slide 13 text

Mission “Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.” - Andrew Tanenbaum

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

NIH - AWS 25 miles, 100 Gbps

Slide 16

Slide 16 text

Mission ftp.ncbi.nlm.nih.gov

Slide 17

Slide 17 text

Mission pip install mrjob

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Untuned end to end throughput = 3 GiB/s = 180 GiB/min

Slide 21

Slide 21 text

What’s next? ● phenotype - genotype cookoff ○ alignment ○ HMM ○ Expert Systems (Jim) ● Sequitur for building new, and verifying old Expert Systems ● EHR ● Imaging ● scikit-bio (yay SciPy2014!)

Slide 22

Slide 22 text

Acknowledgements Jim DeLeo NIH (CC) Jonathan Simon FDA Ben Busby NIH (NCBI) Jon Riehl Resilient WOT talk tomorrow 3:30 resilientscience.github.io/wot @chris_cope ccope@resilientscience.com