Chris Cope and Jim DeLeo
SciPy 2014
Building petabyte-scale
comparative genomics
pipelines
Slide 2
Slide 2 text
NOT A BIOLOGIST
Disclaimer:
Slide 3
Slide 3 text
Mission
Empower citizen scientists to analyze the
many petabytes of public genomic data
● Cloud
● Internet2
● Hide the plumbing (mrjob)
and Sequitur algorithm for genomics
Slide 4
Slide 4 text
Biology stuff
● Next-gen sequencing
● Alignment
● File formats
○ FASTA
○ BAM / SAM
● NIH/NLM/NCBI
More details on the internets
Computing stuff
● AWS (Cloud, MapReduce)
● National Library of Medicine
● Internet2
● Grammars
Slide 13
Slide 13 text
Mission
“Never underestimate the bandwidth of a station wagon full of tapes
hurtling down the highway.”
- Andrew Tanenbaum
Slide 14
Slide 14 text
No content
Slide 15
Slide 15 text
NIH - AWS
25 miles, 100 Gbps
Slide 16
Slide 16 text
Mission
ftp.ncbi.nlm.nih.gov
Slide 17
Slide 17 text
Mission
pip install mrjob
Slide 18
Slide 18 text
No content
Slide 19
Slide 19 text
No content
Slide 20
Slide 20 text
Untuned end to end throughput = 3 GiB/s = 180 GiB/min
Slide 21
Slide 21 text
What’s next?
● phenotype - genotype cookoff
○ alignment
○ HMM
○ Expert Systems (Jim)
● Sequitur for building new, and verifying old Expert
Systems
● EHR
● Imaging
● scikit-bio (yay SciPy2014!)
Slide 22
Slide 22 text
Acknowledgements
Jim DeLeo
NIH (CC)
Jonathan Simon
FDA
Ben Busby
NIH (NCBI)
Jon Riehl
Resilient
WOT talk
tomorrow 3:30
resilientscience.github.io/wot
@chris_cope
ccope@resilientscience.com