Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building petabyte-scale comparative genomics ...

Building petabyte-scale comparative genomics pipelines

This talk will educate the audience about Python tools and best practices for creating reproducible petabyte-scale pipelines. This is done within the context of demonstrating a new grammar-based approach to comparative genomics. The genome grammars are produced using public data from the National Institutes of Health, streamed over a high-throughput Internet2 connection to Amazon Web Services.

Chris Cope

July 09, 2014

Other Decks in Programming


  1. Mission Empower citizen scientists to analyze the many petabytes of

    public genomic data • Cloud • Internet2 • Hide the plumbing (mrjob) and Sequitur algorithm for genomics
  2. Biology stuff • Next-gen sequencing • Alignment • File formats

    ◦ FASTA ◦ BAM / SAM • NIH/NLM/NCBI More details on the internets
  3. Comparative genomics Lexical analysis • alignment • “grep” Syntactic analysis

  4. Mission “Never underestimate the bandwidth of a station wagon full

    of tapes hurtling down the highway.” - Andrew Tanenbaum
  5. What’s next? • phenotype - genotype cookoff ◦ alignment ◦

    HMM ◦ Expert Systems (Jim) • Sequitur for building new, and verifying old Expert Systems • EHR • Imaging • scikit-bio (yay SciPy2014!)
  6. Acknowledgements Jim DeLeo NIH (CC) Jonathan Simon FDA Ben Busby

    NIH (NCBI) Jon Riehl Resilient WOT talk tomorrow 3:30 resilientscience.github.io/wot @chris_cope ccope@resilientscience.com