Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Strata + Hadoop World (San Jose) 2016

Timothy Danford
March 31, 2016
150

Strata + Hadoop World (San Jose) 2016

Presented in the "Spark and Beyond" track.

Timothy Danford

March 31, 2016
Tweet

Transcript

  1. Cancer genomics analysis in the cloud with Spark and ADAM

    Timothy Danford Tamr Inc. UC Berkeley AMPLab
  2. everything I’m about to say is a lie (But maybe

    we can still learn something about software engineering in bioinformatics, in the next 40 minutes.)
  3. A Disease of the Genome Cancer is an important application

    area for Genomics • ”the most common genetic disease” • “a disease of the genome” But cancer poses special challenges to ”regular” genomics tools • Analysis is differential • Contamination is hard to manage • Tumors evolve! Ding et al. Nature (2012) Fan et al. Oncology Letters (2012)
  4. biology may be impossible, but bioinformatics is just software (Even

    though bioinformaticians often treat their software like a black box.)
  5. 1. Genomics is important 2. Bioinformatics shouldn’t be a black

    box 3. Just the right amount of wheel re-invention 4. You can help! My Goals Today
  6. Under the Hood of a Somatic Variant Caller ▪ Mutect

    is a cutting- edge somatic variant caller for single nucleotide variation (SNV) ▪ Its workflow is a likelihood-odds calculation ▪ … wrapped in pre- and post- filters Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
  7. Second Verse (Almost the) Same as the First ▪ Strelka

    is another variant caller ▪ Both SNVs and Indels (Insertions- Deletions) ▪ Maximum-likelihood model from reads ▪ Numeric integration over allele frequencies ▪ Filters based on running the tool at High and Low Confidence levels Saunders et al. Bioinformatics 28, 1811-7 (2012)
  8. What do a Variant Caller & a Spam Filter Have

    in Common? The building blocks of these systems might be familiar • Bayesian stuff • Likelihood functions • Numerical integration • Sampling
  9. Let’s Build the Bioinformatics of the Future http://www.commonwl.org/ Bioinformaticians are

    actively inventing ▪ Languages ▪ Tools ▪ User Interfaces for capturing and authoring these pipelines and workflows.
  10. You Keep Saying “Workflow” • Toil is a workflow execution

    framework from UCSC • Apache 2 licensed • Python-based framework • Adapters for CWL, WDL (Broad Institute) https://github.com/BD2KGenomics/toil
  11. Toil and Trouble • Python modules to wrap standalone tools

    • Integrations with Docker • Toil manages progress, dependencies, restarts • Does this look like anything you’ve seen before?
  12. Spark + Genomics = ADAM • Hosted at Berkeley and

    the AMPLab • Apache 2 License • Contributors from both research and commercial organizations • Core spatial primitives, variant calling • Avro and Parquet for data models and file formats http://bdgenomics.org/
  13. ADAM has Data Models, Common Operations Common Ops (e.g. Interval

    Join) Data Models (File Formats via Parquet)
  14. ADAM and Toil Support the NCI Cloud Pilot Initial results

    on a hybrid GATK + ADAM variant calling pipeline (vs. GATK-only), coordinated using Toil Hybrid ADAM pipeline 2x – 3.5x faster than GATK-only on comparable hardware • GATK-only pipeline was 1.8x more expensive Majority of these gains from parallelizing filtering / pre-processing These ADAM stages were 35x faster, and 10% of the cost, than GATK counterparts
  15. Bioinformatics Needs Your Help! Bioinformaticians Need Your Skills • Spark

    / Hadoop / etc. • Distributed Systems • “Big Data” • Advanced analytics You Can Help -­ Treat a patient -­ Discover a Drug -­ Cure a Disease Join Us!
  16. Thanks to... Matt Massie Frank Nothaft Uri Laserson Carl Yeksigian

    Michael Heuer Justin Paschall Jeff Hammerbacher Anthony Philippakis Beau Norgeot Hannes Schmidt Benedict Paten Andy Palmer Nidhi Agarwal Ryan Williams Janice Brown David Bernick And thank you! Questions?
  17. End