Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Strata NYC 2015] "Next-Generation Genomics Using ADAM and Spark"

Timothy Danford
September 30, 2015

[Strata NYC 2015] "Next-Generation Genomics Using ADAM and Spark"

Presentation giving on 9/30/2015 at Strata NYC.

Timothy Danford

September 30, 2015
Tweet

More Decks by Timothy Danford

Other Decks in Science

Transcript

  1. Lambert et al. “Meta-analysis of 74,046 individuals identifies 11 new

    susceptibility loci for Alzheimer's disease” (2013)
  2. A Tale of Three File Formats BAM Files: Do You

    Read Me? Compressed text files & custom index formats User-defined attributes Multi-record structure
  3. Why Are We Still Defining File Formats By Hand? • 

    Instead of defining custom file formats for each data type and access pattern… •  Parquet creates a compressed format for each Avro-defined data model. •  Improvement over existing formats1 •  20-22% for BAM •  ~95% for VCF 1compression % quoted from 1K Genomes samples
  4. Spark + Genomics = ADAM •  Hosted at Berkeley and

    the AMPLab •  Apache 2 License •  Contributors from both research and commercial organizations •  Core spatial primitives, variant calling •  Avro and Parquet for data models and file formats
  5. The Terrible Trouble with Existing Pipelines Cibulskis et al. “Sensitive

    detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)
  6. “I think you know what the problem is, just as

    well as I do.” A single piece of a filtering stage for a somatic variant caller “11-base-pair window centered on a candidate mutation” actually turns out to be optimized for a particular file format and sort order
  7. “Myths of Bioinformatics Software” 1.  Somebody will build on your

    code 2.  You should have assembled a team to build your software 3.  If you choose the right license, more people will use and build on your software. 4.  Making software free for commercial use shows you are not against companies. 5.  You should maintain your software indefinitely 6.  Your “stable URL” can exist forever 7.  You should make your software “idiot proof” 8.  You used the right programming language for the task. Lior Pachter https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/
  8. Thanks to... Matt Massie Frank Nothaft Uri Laserson Carl Yeksigian

    Michael Heuer Jeff Hammerbacher Anthony Philippakis Andy Palmer Nidhi Agarwal Janice Brown David Bernick David An Eric Golin And thank you! Questions?