[Spark for Big Data Analytics Symposium] "Next-Generation Genomics Using Spark & ADAM"

Next-Generation Genomics Using Spark & ADAM Timothy Danford UC Berkeley,
AMPLab Tamr Inc.

“Myths of Bioinformatics Software” 1.  Somebody will build on your
code. 2.  You should have assembled a team to build your software. 3.  If you choose the right license, more people will use and build on your software. 4.  Making software free for commercial use shows you are not against companies. 5.  You should maintain your software indefinitely. 6.  Your “stable URL” can exist forever. 7.  You should make your software “idiot proof.” 8.  You used the right programming language for the task. Lior Pachter https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/

Existing Bioinformatics Software is a Hopeless Mix of Method, Code,
and Platform

Formats are Incompletely Speciﬁed, Reinvented Databases Compressed text files &
custom index formats User-defined attributes Multi-record structure

“Pipelines” are Stitched Together From Command-line Tools

State-of-the-Art: shared ﬁlesystems, ad hoc parallelism •  Hand-written task creation
•  File formats instead of APIs or data models –  formats are poorly defined –  contain optional or redundant fields –  semantics are unclear •  Workflow engines can’t take advantage of common parallelism between stages

Spark takes advantage of shared parallelism throughout a pipeline • 
Many genomics analyses are naturally parallelizable •  Pipelines can often share parallelism between stages •  No intermediate files •  Separate implementation concerns: –  parallelization and scaling in the platform –  let methods developers focus on methods

Spark + Genomics = ADAM •  Hosted at Berkeley and
the AMPLab •  Apache 2 License •  Contributors from both research and commercial organizations •  Core spatial primitives, variant calling •  Avro and Parquet for data models and file formats

Why Are We Still Deﬁning File Formats By Hand? • 
Instead of defining custom file formats for each data type and access pattern… •  Parquet creates a compressed format for each Avro-defined data model. •  Improvement over existing formats1 •  20-22% for BAM •  ~95% for VCF 1compression % quoted from 1K Genomes samples

ADAM Implements Core Genomics Primitives On Spark (Frank will talk
about this in more detail…)

Existing Methods Often Written as a Sequence of Filters Cibulskis
et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)

File Assumptions are Still Embedded in Our Methods •  A
single piece of a single filtering stage for a somatic variant caller •  Can you spot the embedded ﬁle- format assumption? Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)

File Assumptions are Still Embedded in Our Methods •  A
single piece of a single filtering stage for a somatic variant caller •  “11-base-pair window centered on a candidate mutation” actually turns out to be optimized for a particular file format and sort order Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)

“Yet Another Workﬂow Engine??”

Does Bioinformatics Need Another “Workﬂow Engine?” •  No: it has
a few already, it will require rewriting all our software, we should focus on methods instead. •  Yes: we need to move to commodity computing, start planning for a day when sharing is not copying, write methods that scale with more resources Most importantly: separate “developing a method” from “building a platform,” and allow different developers to work separately on both

“Myths of Bioinformatics Software” 1.  Somebody will build on your
code. 2.  You should have assembled a team to build your software. 3.  If you choose the right license, more people will use and build on your software. 4.  Making software free for commercial use shows you are not against companies. 5.  You should maintain your software indefinitely. 6.  Your “stable URL” can exist forever. 7.  You should make your software “idiot proof.” 8.  You used the right programming language for the task. Lior Pachter https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/

1.  Somebody will build on your code. 2.  You should
have assembled a team to build your software. 3.  If you choose the right license, more people will use and build on your software. 4.  Making software free for commercial use shows you are not against companies. 5.  You should maintain your software indefinitely. 6.  Your “stable URL” can exist forever. 7.  You should make your software “idiot proof.” 8.  You used the right programming language for the task. Lior Pachter https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/ “Myths of Bioinformatics Software”

Thanks to... Matt Massie Frank Nothaft Uri Laserson Carl Yeksigian
Michael Heuer Jeff Hammerbacher Andy Palmer Nidhi Agarwal David Bernick David An Eric Golin Anthony Philippakis And thank you!

[Spark for Big Data Analytics Symposium] "Next-...

[Spark for Big Data Analytics Symposium] "Next-Generation Genomics Using Spark & ADAM"

Timothy Danford

More Decks by Timothy Danford

Other Decks in Science

Featured

Transcript

Next-Generation Genomics Using Spark & ADAM Timothy Danford UC Berkeley,

“Myths of Bioinformatics Software” 1.  Somebody will build on your

Existing Bioinformatics Software is a Hopeless Mix of Method, Code,

Existing Bioinformatics Software is a Hopeless Mix of Method, Code,

Existing Bioinformatics Software is a Hopeless Mix of Method, Code,

Formats are Incompletely Speciﬁed, Reinvented Databases Compressed text files &

“Pipelines” are Stitched Together From Command-line Tools

“Pipelines” are Stitched Together From Command-line Tools

“Pipelines” are Stitched Together From Command-line Tools

State-of-the-Art: shared ﬁlesystems, ad hoc parallelism •  Hand-written task creation

Spark takes advantage of shared parallelism throughout a pipeline •

Spark + Genomics = ADAM •  Hosted at Berkeley and

Why Are We Still Deﬁning File Formats By Hand? •

ADAM Implements Core Genomics Primitives On Spark (Frank will talk

Existing Methods Often Written as a Sequence of Filters Cibulskis

File Assumptions are Still Embedded in Our Methods •  A

File Assumptions are Still Embedded in Our Methods •  A

“Yet Another Workﬂow Engine??”

Does Bioinformatics Need Another “Workﬂow Engine?” •  No: it has

“Myths of Bioinformatics Software” 1.  Somebody will build on your

1.  Somebody will build on your code. 2.  You should

Thanks to... Matt Massie Frank Nothaft Uri Laserson Carl Yeksigian