Building petabyte-scale comparative genomics pipelines

Chris Cope and Jim DeLeo SciPy 2014 Building petabyte-scale comparative
genomics pipelines

NOT A BIOLOGIST Disclaimer:

Mission Empower citizen scientists to analyze the many petabytes of
public genomic data • Cloud • Internet2 • Hide the plumbing (mrjob) and Sequitur algorithm for genomics

Biology stuff • Next-gen sequencing • Alignment • File formats
◦ FASTA ◦ BAM / SAM • NIH/NLM/NCBI More details on the internets

Genomics, why?

Sequence Alignment Map

Sequitur algorithm

resilientscience.github.io/wot WOT WOT?

Comparative genomics Lexical analysis • alignment • “grep” Syntactic analysis
• grammar • CFG comparison CTGACTGTCGACCTCACGAAGTCCGCCGTAAGC CTGACACCCGACCTCGCGAAGTCCGCCAAGCTC

Computing stuff • AWS (Cloud, MapReduce) • National Library of
Medicine • Internet2 • Grammars

Mission “Never underestimate the bandwidth of a station wagon full
of tapes hurtling down the highway.” - Andrew Tanenbaum

NIH - AWS 25 miles, 100 Gbps

Mission ftp.ncbi.nlm.nih.gov

Mission pip install mrjob

Untuned end to end throughput = 3 GiB/s = 180
GiB/min

What’s next? • phenotype - genotype cookoff ◦ alignment ◦
HMM ◦ Expert Systems (Jim) • Sequitur for building new, and verifying old Expert Systems • EHR • Imaging • scikit-bio (yay SciPy2014!)

Acknowledgements Jim DeLeo NIH (CC) Jonathan Simon FDA Ben Busby
NIH (NCBI) Jon Riehl Resilient WOT talk tomorrow 3:30 resilientscience.github.io/wot @chris_cope [email protected]

Building petabyte-scale comparative genomics ...

Building petabyte-scale comparative genomics pipelines

Chris Cope

Other Decks in Programming

Featured

Transcript

Chris Cope and Jim DeLeo SciPy 2014 Building petabyte-scale comparative

NOT A BIOLOGIST Disclaimer:

Mission Empower citizen scientists to analyze the many petabytes of

Biology stuff • Next-gen sequencing • Alignment • File formats

Genomics, why?

Sequence Alignment Map

Sequitur algorithm

resilientscience.github.io/wot WOT WOT?

Comparative genomics Lexical analysis • alignment • “grep” Syntactic analysis

Computing stuff • AWS (Cloud, MapReduce) • National Library of

Mission “Never underestimate the bandwidth of a station wagon full

NIH - AWS 25 miles, 100 Gbps

Mission ftp.ncbi.nlm.nih.gov

Mission pip install mrjob

Untuned end to end throughput = 3 GiB/s = 180

What’s next? • phenotype - genotype cookoff ◦ alignment ◦

Acknowledgements Jim DeLeo NIH (CC) Jonathan Simon FDA Ben Busby