Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SMRT Informatics Developers Conference 2016 Keynote

Sergey Koren
January 13, 2016

SMRT Informatics Developers Conference 2016 Keynote

Single molecule assembly for genomes large and small.

Sergey Koren

January 13, 2016
Tweet

More Decks by Sergey Koren

Other Decks in Science

Transcript

  1. Theorem: Long reads solve everything Proof Everything is a metagenome

    Assembly solves metagenomics Long reads solve assembly Long reads solve everything * *This is hyperbole
  2. ! Want to find an optimal tour !  Consistent with the

    original fragments !  Fragment arrival rate is consistent with a Poisson process !  Read-pair distance and orientation is preserved, etc… ! Many possible paths !  Use a variety of heuristics to find good path !  Output only the high-confidence paths through the graph (contigs) ! Hard to assemble, easier to validate !  Run a bunch and pick the best! Mark Chaisson Short read assembly is hard Automated ensemble assembly and validation of microbial genomes. Koren et al. (2014) BMC Bioinformatics
  3. ! Repeats cause tangles and cycles in the graph !  A

    randomly generated sequence is trivial to assemble !  Repeats shorter than the read length don’t matter ! “It” ! “It was” ! “It was the best” ! “It was the best of times” ! “With his hands in his pockets” Why is assembly hard? >1,000 SSR 320 TE 2 SegDup 1 Unique 3 Meta
  4. How long are microbial repeats? Golden Threshold Reducing assembly complexity

    of microbial genomes with single-molecule sequencing. Koren et al. (2013) Genome Research
  5. k = 1,000 k = 7,000 Golden Threshold P5C3 (84%

    > 7Kbp) How long do reads need to be? k = 50 One chromosome, one contig. Koren and Phillippy (2015) Curr Opin Microbiol
  6. Hybrid error correction and de novo assembly of single-molecule sequencing

    reads Koren et al. (2012) Nature Biotechnology Reducing assembly complexity of microbial genomes with single-molecule sequencing Koren et al. (2013) Genome Biology Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing Berlin et al. (2015) Nature Biotechnology With Canu: 25x of PacBio P6C4 achieves: > 90% of bacteria assemble without gaps > QV40 (99.99%) consensus accuracy < 15 minutes of compute < $1,000 total cost Long read assembly
  7. Improvements in: Y heterochromatin Telomeres, centromeres TE and repeat resolution

    New biology discovered! Complete assembly of Chromosome 3L D. melanogaster Assembly
  8. Birth of a new gene on the Y chromosome of

    Drosophila melanogaster Carvalho et al. (2015) Finally, we emphasize the utility of PacBio technology in dealing with difficult genomic regions: as was the case with the Mst77Y region, [MHAP] produced a seemingly error-free assembly of the FDY region, something that has eluded us for years of hard work. Here we describe flagrante delicto Y (FDY), a very young gene that shows how Y-linked genes were acquired. FDY originated 2 million years ago from a duplication of a contiguous autosomal segment of 11 kb containing five genes that inserted into the Y chromosome. Four of these autosome-to-Y gene copies became inactivated (“pseudogenes”), lost part of their sequences, and most likely will disappear in the next few million years. FDY, originally a female- biased gene, acquired testis expression and remained functional.
  9. 1 2 3 4 5 6 7 8 9 10

    11 12 13 14 15 16 17 18 19 20 21 22 X CHM1 Canu Human genome assembly solved?
  10. Building a better reference 2.1 2.11 2.12 2.13 2.14 2.15

    2.16 2.17 2.18 2.19 100 200 300 400 500 600 #NA12878 - #CHM13:125 hets (×106) #CHM13:125 heterozygous SNPs (×103) CHM13-0983455 CHM13-0983465 CHM13-0983475 CHM13-1015355 CHM13-1015385 CHM1-0772585 CHM1-1.1 CHM1-1007805 hs37m hs38 huref Heng Li, Personal Communication
  11. Genetic variation and the de novo assembly of human genomes

    Chaisson, Wilson, Eichler (2015) emphasize the importance of complete de novo assembly as opposed to read mapping as the primary means to understanding the full range of human genetic variation.
  12. Bridging the gap From http://www.bionanogenomics.com/technology/irys-technology/ Bionano Genomics HiC Crosslink) Fragment)

    Proximity) Liga4on) Sequence) Junc4ons) adapted from Lieberman-Aiden, et. al. Science, 2009
  13. 3D model of genome Hi-C can be used for 3D

    modeling and scaffolding of genome assemblies Duan, et al. Nature, 2010 Burton, et al. Nature Biotech, 2013 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 chromosome chromosome From I. Liachko Haplotype phasing Selvaraj, et al. Nature Biotech, 2013 Crosslink) Fragment) Proximity) Liga4on) Sequence) Junc4ons) Genome scaffolding
  14. Goat PacBio+BioNano vs RefV2 RefV2 PacBio PacBio + BioNano PacBio

    + BioNano + HiC Ctgs # bp 2.72Gbp 2.63Gbp 2.62Gbp 2.63Gb # ctg 173,141 3,096 2,349 1,522 Max 679,126 35,623,478 57,511,119 66,489,255 N50 73,533 4,473,169 12,102,227 23,340,314 Scfs # bp 2.69Gbp 2.63Gbp 2.62Gbp 2.63Gbp # scf 30 3,096 2,084 525 Max 161,917,960 35,623,478 66,727,870 157,517,791 N50 103,731,018 4,473,169 14,265,070 91,787,174
  15. Canu Overlap MHAP: MinHash alignment Correct PBCR: Corrected reads Trim

    OBT: Overlap based trimming Assemble BOGART: Best overlap graph Consensus UTGCNS: Refactored, >100X faster Polish Quiver / Nanopolish
  16. Conclusions & Future Work ! Centromeres/telomeres remain a challenge !  HiC

    fails due to low interactions !  Bionano fails due to missing restriction sites ! Improved algorithms to combine technologies !  Assembler output ! Assembling populations !  Graph based assembly and formats !  Long read polishing and phasing
  17. Acknowledgements !  Canu !  Adam Phillippy !  Brian Walenz ! 

    MHAP !  Konstatin Berlin !  Goat !  Derek M. Bickhart !  Adam M Phillippy !  Timothy P.L. Smith !  Shawn T. Sullivan !  Ivan Liachko !  Joshua N. Burton !  Maitreya J. Dunham !  Jay Shendure !  Alex R. Hastie !  Brian L. Sayre !  Heather J Huson !  George E. Liu !  Benjamin D. Rosen !  Steven G. Schroeder !  Curtis P. VanTassell !  Tad S. Sonstegard !  Dermatofibrosarcoma !  Alejandro Gutierrez !  Sarah Morton !  Mike Schatz !  Maria Nattestad !  Fritz Sedlazeck !  A. gambiae !  Andrew Hall !  Philippos-Aris Papathanos !  Atashi Sharma !  Changde Cheng !  Omar Akbari !  Lauren Assour !  Nicholas Bergman !  Alessia Cagnetti !  Andrea Crisanti !  Tania Dottorini !  Elisa Fiorentini !  Roberto Galizi !  Jonathan Hnath !  Xiaofang Jiang !  Tony Nolan !  Diana Radune !  Maria Sharakhova !  Aaron Steele !  Vladimir A. Timoshevskiy !  Nikolai Windbichler !  Simo Zhangl !  Matthew W. Hahn !  Scott J. Emrich !  Igor V. Sharakhov !  Zhijian Tu !  Nora J. Besansky !  D. melanogaster !  Casey Bergman !  Sue Celniker !  Jason Chin !  Jane Landolin !  NHGRI ! Postdocs wanted! !  Genome Informatics Section !  Assembly !  Structural variation !  Infectious disease !  Undiagnosed disease !  http://www.genome.gov/27563366 /MarBL
  18. PUBLIC DOMAIN NOTICE This presentation is "United States Government Work"

    under the terms of the United States Copyright Act. It was written as part of the authors' official duties for the United States Government and thus cannot be copyrighted. This presentation is freely available to the public for use without a copyright notice. Restrictions cannot be placed on its present or future use. Although all reasonable efforts have been taken to ensure the accuracy and reliability of the presentation and associated data, the National Human Genome Research Institute (NHGRI), National Institutes of Health (NIH) and the U.S. Government do not and cannot warrant the performance or results that may be obtained based on this presentation or data. NHGRI, NIH and the U.S. Government disclaim all warranties as to performance, merchantability or fitness for any particular purpose. Please cite the authors in any work or product based on this material.