SMRT Informatics Developers Conference 2016 Keynote

Sergey Koren, @sergekoren Genome Informatics Section, NHGRI Single-molecule assembly for
genomes large and small

Theorem: Long reads solve everything Proof Everything is a metagenome
Assembly solves metagenomics Long reads solve assembly Long reads solve everything * *This is hyperbole

! Want to find an optimal tour !  Consistent with the
original fragments !  Fragment arrival rate is consistent with a Poisson process !  Read-pair distance and orientation is preserved, etc… ! Many possible paths !  Use a variety of heuristics to find good path !  Output only the high-confidence paths through the graph (contigs) ! Hard to assemble, easier to validate !  Run a bunch and pick the best! Mark Chaisson Short read assembly is hard Automated ensemble assembly and validation of microbial genomes. Koren et al. (2014) BMC Bioinformatics

! Repeats cause tangles and cycles in the graph !  A
randomly generated sequence is trivial to assemble !  Repeats shorter than the read length don’t matter ! “It” ! “It was” ! “It was the best” ! “It was the best of times” ! “With his hands in his pockets” Why is assembly hard? >1,000 SSR 320 TE 2 SegDup 1 Unique 3 Meta

How long are microbial repeats? Golden Threshold Reducing assembly complexity
of microbial genomes with single-molecule sequencing. Koren et al. (2013) Genome Research

k = 1,000 k = 7,000 Golden Threshold P5C3 (84%
> 7Kbp) How long do reads need to be? k = 50 One chromosome, one contig. Koren and Phillippy (2015) Curr Opin Microbiol

Hybrid error correction and de novo assembly of single-molecule sequencing
reads Koren et al. (2012) Nature Biotechnology Reducing assembly complexity of microbial genomes with single-molecule sequencing Koren et al. (2013) Genome Biology Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing Berlin et al. (2015) Nature Biotechnology With Canu: 25x of PacBio P6C4 achieves: > 90% of bacteria assemble without gaps > QV40 (99.99%) consensus accuracy < 15 minutes of compute < $1,000 total cost Long read assembly

Eukaryotic assembly Error correction and assembly complexity of single molecule
sequencing reads. Lee et al. (2014) bioRxiv

Improvements in: Y heterochromatin Telomeres, centromeres TE and repeat resolution
New biology discovered! Complete assembly of Chromosome 3L D. melanogaster Assembly

Drosophila melanogaster Y Casey Bergman, Sue Celniker, Jason Chin, Jane
Landolin Tracey Chapman

Birth of a new gene on the Y chromosome of
Drosophila melanogaster Carvalho et al. (2015) Finally, we emphasize the utility of PacBio technology in dealing with difficult genomic regions: as was the case with the Mst77Y region, [MHAP] produced a seemingly error-free assembly of the FDY region, something that has eluded us for years of hard work. Here we describe flagrante delicto Y (FDY), a very young gene that shows how Y-linked genes were acquired. FDY originated 2 million years ago from a duplication of a contiguous autosomal segment of 11 kb containing five genes that inserted into the Y chromosome. Four of these autosome-to-Y gene copies became inactivated (“pseudogenes”), lost part of their sequences, and most likely will disappear in the next few million years. FDY, originally a female- biased gene, acquired testis expression and remained functional.

Anopheles gambiae Y Nora Besanksy, ND Sam Cotton Igor Sharakhov

1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20 21 22 X CHM1 Canu Human genome assembly solved?

Building a better reference 2.1 2.11 2.12 2.13 2.14 2.15
2.16 2.17 2.18 2.19 100 200 300 400 500 600 #NA12878 - #CHM13:125 hets (×106) #CHM13:125 heterozygous SNPs (×103) CHM13-0983455 CHM13-0983465 CHM13-0983475 CHM13-1015355 CHM13-1015385 CHM1-0772585 CHM1-1.1 CHM1-1007805 hs37m hs38 huref Heng Li, Personal Communication

Genetic variation and the de novo assembly of human genomes
Chaisson, Wilson, Eichler (2015) emphasize the importance of complete de novo assembly as opposed to read mapping as the primary means to understanding the full range of human genetic variation.

Structural Analysis in Cancers Alejandro Gutierrez & Sarah Morton Mike
Schatz & Maria Nattestad

Contigs ≠ Genome ≠

Bridging the gap From http://www.bionanogenomics.com/technology/irys-technology/ Bionano Genomics HiC Crosslink) Fragment)
Proximity) Liga4on) Sequence) Junc4ons) adapted from Lieberman-Aiden, et. al. Science, 2009

3D model of genome Hi-C can be used for 3D
modeling and scaffolding of genome assemblies Duan, et al. Nature, 2010 Burton, et al. Nature Biotech, 2013 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 chromosome chromosome From I. Liachko Haplotype phasing Selvaraj, et al. Nature Biotech, 2013 Crosslink) Fragment) Proximity) Liga4on) Sequence) Junc4ons) Genome scaffolding

Goat PacBio+BioNano vs RefV2 RefV2 PacBio PacBio + BioNano PacBio
+ BioNano + HiC Ctgs # bp 2.72Gbp 2.63Gbp 2.62Gbp 2.63Gb # ctg 173,141 3,096 2,349 1,522 Max 679,126 35,623,478 57,511,119 66,489,255 N50 73,533 4,473,169 12,102,227 23,340,314 Scfs # bp 2.69Gbp 2.63Gbp 2.62Gbp 2.63Gbp # scf 30 3,096 2,084 525 Max 161,917,960 35,623,478 66,727,870 157,517,791 N50 103,731,018 4,473,169 14,265,070 91,787,174

chr1 chr8 chrX chr28 PacBio+BioNano+HiC to reference

Continuing speedups D. melanogaster H. sapiens ?

Canu Overlap MHAP: MinHash alignment Correct PBCR: Corrected reads Trim
OBT: Overlap based trimming Assemble BOGART: Best overlap graph Consensus UTGCNS: Refactored, >100X faster Polish Quiver / Nanopolish

Conclusions & Future Work ! Centromeres/telomeres remain a challenge !  HiC
fails due to low interactions !  Bionano fails due to missing restriction sites ! Improved algorithms to combine technologies !  Assembler output ! Assembling populations !  Graph based assembly and formats !  Long read polishing and phasing

Acknowledgements !  Canu !  Adam Phillippy !  Brian Walenz ! 
MHAP !  Konstatin Berlin !  Goat !  Derek M. Bickhart !  Adam M Phillippy !  Timothy P.L. Smith !  Shawn T. Sullivan !  Ivan Liachko !  Joshua N. Burton !  Maitreya J. Dunham !  Jay Shendure !  Alex R. Hastie !  Brian L. Sayre !  Heather J Huson !  George E. Liu !  Benjamin D. Rosen !  Steven G. Schroeder !  Curtis P. VanTassell !  Tad S. Sonstegard !  Dermatofibrosarcoma !  Alejandro Gutierrez !  Sarah Morton !  Mike Schatz !  Maria Nattestad !  Fritz Sedlazeck !  A. gambiae !  Andrew Hall !  Philippos-Aris Papathanos !  Atashi Sharma !  Changde Cheng !  Omar Akbari !  Lauren Assour !  Nicholas Bergman !  Alessia Cagnetti !  Andrea Crisanti !  Tania Dottorini !  Elisa Fiorentini !  Roberto Galizi !  Jonathan Hnath !  Xiaofang Jiang !  Tony Nolan !  Diana Radune !  Maria Sharakhova !  Aaron Steele !  Vladimir A. Timoshevskiy !  Nikolai Windbichler !  Simo Zhangl !  Matthew W. Hahn !  Scott J. Emrich !  Igor V. Sharakhov !  Zhijian Tu !  Nora J. Besansky !  D. melanogaster !  Casey Bergman !  Sue Celniker !  Jason Chin !  Jane Landolin !  NHGRI ! Postdocs wanted! !  Genome Informatics Section !  Assembly !  Structural variation !  Infectious disease !  Undiagnosed disease !  http://www.genome.gov/27563366 /MarBL

PUBLIC DOMAIN NOTICE This presentation is "United States Government Work"
under the terms of the United States Copyright Act. It was written as part of the authors' official duties for the United States Government and thus cannot be copyrighted. This presentation is freely available to the public for use without a copyright notice. Restrictions cannot be placed on its present or future use. Although all reasonable efforts have been taken to ensure the accuracy and reliability of the presentation and associated data, the National Human Genome Research Institute (NHGRI), National Institutes of Health (NIH) and the U.S. Government do not and cannot warrant the performance or results that may be obtained based on this presentation or data. NHGRI, NIH and the U.S. Government disclaim all warranties as to performance, merchantability or fitness for any particular purpose. Please cite the authors in any work or product based on this material.

SMRT Informatics Developers Conference 2016 Key...

SMRT Informatics Developers Conference 2016 Keynote

Sergey Koren

More Decks by Sergey Koren

Other Decks in Science

Featured

Transcript

Sergey Koren, @sergekoren Genome Informatics Section, NHGRI Single-molecule assembly for

Theorem: Long reads solve everything Proof Everything is a metagenome

! Want to find an optimal tour !  Consistent with the

! Repeats cause tangles and cycles in the graph !  A

How long are microbial repeats? Golden Threshold Reducing assembly complexity

k = 1,000 k = 7,000 Golden Threshold P5C3 (84%

Hybrid error correction and de novo assembly of single-molecule sequencing

Eukaryotic assembly Error correction and assembly complexity of single molecule

Improvements in: Y heterochromatin Telomeres, centromeres TE and repeat resolution

Drosophila melanogaster Y Casey Bergman, Sue Celniker, Jason Chin, Jane

Birth of a new gene on the Y chromosome of

Anopheles gambiae Y Nora Besanksy, ND Sam Cotton Igor Sharakhov

1 2 3 4 5 6 7 8 9 10

Building a better reference 2.1 2.11 2.12 2.13 2.14 2.15

Genetic variation and the de novo assembly of human genomes

Structural Analysis in Cancers Alejandro Gutierrez & Sarah Morton Mike

Contigs ≠ Genome ≠

Bridging the gap From http://www.bionanogenomics.com/technology/irys-technology/ Bionano Genomics HiC Crosslink) Fragment)

3D model of genome Hi-C can be used for 3D

Goat PacBio+BioNano vs RefV2 RefV2 PacBio PacBio + BioNano PacBio

chr1 chr8 chrX chr28 PacBio+BioNano+HiC to reference

Continuing speedups D. melanogaster H. sapiens ?

Canu Overlap MHAP: MinHash alignment Correct PBCR: Corrected reads Trim

Conclusions & Future Work ! Centromeres/telomeres remain a challenge !  HiC

Acknowledgements !  Canu !  Adam Phillippy !  Brian Walenz !

PUBLIC DOMAIN NOTICE This presentation is "United States Government Work"