Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building petabyte-scale comparative genomics pipelines

Building petabyte-scale comparative genomics pipelines

This talk will educate the audience about Python tools and best practices for creating reproducible petabyte-scale pipelines. This is done within the context of demonstrating a new grammar-based approach to comparative genomics. The genome grammars are produced using public data from the National Institutes of Health, streamed over a high-throughput Internet2 connection to Amazon Web Services.

Chris Cope

July 09, 2014
Tweet

Other Decks in Programming

Transcript

  1. Mission Empower citizen scientists to analyze the many petabytes of

    public genomic data • Cloud • Internet2 • Hide the plumbing (mrjob) and Sequitur algorithm for genomics
  2. Biology stuff • Next-gen sequencing • Alignment • File formats

    ◦ FASTA ◦ BAM / SAM • NIH/NLM/NCBI More details on the internets
  3. Comparative genomics Lexical analysis • alignment • “grep” Syntactic analysis

    • grammar • CFG comparison CTGACTGTCGACCTCACGAAGTCCGCCGTAAGC CTGACACCCGACCTCGCGAAGTCCGCCAAGCTC
  4. Mission “Never underestimate the bandwidth of a station wagon full

    of tapes hurtling down the highway.” - Andrew Tanenbaum
  5. What’s next? • phenotype - genotype cookoff ◦ alignment ◦

    HMM ◦ Expert Systems (Jim) • Sequitur for building new, and verifying old Expert Systems • EHR • Imaging • scikit-bio (yay SciPy2014!)
  6. Acknowledgements Jim DeLeo NIH (CC) Jonathan Simon FDA Ben Busby

    NIH (NCBI) Jon Riehl Resilient WOT talk tomorrow 3:30 resilientscience.github.io/wot @chris_cope [email protected]