Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Spark for Big Data Analytics Symposium] "Next-Generation Genomics Using Spark & ADAM"

Timothy Danford
December 02, 2015

[Spark for Big Data Analytics Symposium] "Next-Generation Genomics Using Spark & ADAM"

Presentation given via WebEx on Dec. 2nd, 2015, at the "Spark for Big Data Analytics" Symposium.

Timothy Danford

December 02, 2015
Tweet

More Decks by Timothy Danford

Other Decks in Science

Transcript

  1. “Myths of Bioinformatics Software” 1.  Somebody will build on your

    code. 2.  You should have assembled a team to build your software. 3.  If you choose the right license, more people will use and build on your software. 4.  Making software free for commercial use shows you are not against companies. 5.  You should maintain your software indefinitely. 6.  Your “stable URL” can exist forever. 7.  You should make your software “idiot proof.” 8.  You used the right programming language for the task. Lior Pachter https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/
  2. Formats are Incompletely Specified, Reinvented Databases Compressed text files &

    custom index formats User-defined attributes Multi-record structure
  3. State-of-the-Art: shared filesystems, ad hoc parallelism •  Hand-written task creation

    •  File formats instead of APIs or data models –  formats are poorly defined –  contain optional or redundant fields –  semantics are unclear •  Workflow engines can’t take advantage of common parallelism between stages
  4. Spark takes advantage of shared parallelism throughout a pipeline • 

    Many genomics analyses are naturally parallelizable •  Pipelines can often share parallelism between stages •  No intermediate files •  Separate implementation concerns: –  parallelization and scaling in the platform –  let methods developers focus on methods
  5. Spark + Genomics = ADAM •  Hosted at Berkeley and

    the AMPLab •  Apache 2 License •  Contributors from both research and commercial organizations •  Core spatial primitives, variant calling •  Avro and Parquet for data models and file formats
  6. Why Are We Still Defining File Formats By Hand? • 

    Instead of defining custom file formats for each data type and access pattern… •  Parquet creates a compressed format for each Avro-defined data model. •  Improvement over existing formats1 •  20-22% for BAM •  ~95% for VCF 1compression % quoted from 1K Genomes samples
  7. Existing Methods Often Written as a Sequence of Filters Cibulskis

    et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)
  8. File Assumptions are Still Embedded in Our Methods •  A

    single piece of a single filtering stage for a somatic variant caller •  Can you spot the embedded file- format assumption? Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
  9. File Assumptions are Still Embedded in Our Methods •  A

    single piece of a single filtering stage for a somatic variant caller •  “11-base-pair window centered on a candidate mutation” actually turns out to be optimized for a particular file format and sort order Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
  10. Does Bioinformatics Need Another “Workflow Engine?” •  No: it has

    a few already, it will require rewriting all our software, we should focus on methods instead. •  Yes: we need to move to commodity computing, start planning for a day when sharing is not copying, write methods that scale with more resources Most importantly: separate “developing a method” from “building a platform,” and allow different developers to work separately on both
  11. “Myths of Bioinformatics Software” 1.  Somebody will build on your

    code. 2.  You should have assembled a team to build your software. 3.  If you choose the right license, more people will use and build on your software. 4.  Making software free for commercial use shows you are not against companies. 5.  You should maintain your software indefinitely. 6.  Your “stable URL” can exist forever. 7.  You should make your software “idiot proof.” 8.  You used the right programming language for the task. Lior Pachter https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/
  12. 1.  Somebody will build on your code. 2.  You should

    have assembled a team to build your software. 3.  If you choose the right license, more people will use and build on your software. 4.  Making software free for commercial use shows you are not against companies. 5.  You should maintain your software indefinitely. 6.  Your “stable URL” can exist forever. 7.  You should make your software “idiot proof.” 8.  You used the right programming language for the task. Lior Pachter https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/ “Myths of Bioinformatics Software”
  13. Thanks to... Matt Massie Frank Nothaft Uri Laserson Carl Yeksigian

    Michael Heuer Jeff Hammerbacher Andy Palmer Nidhi Agarwal David Bernick David An Eric Golin Anthony Philippakis And thank you!