$30 off During Our Annual Pro Sale. View Details »

SMRT Informatics Developers Conference 2016 Keynote

Sergey Koren
January 13, 2016

SMRT Informatics Developers Conference 2016 Keynote

Single molecule assembly for genomes large and small.

Sergey Koren

January 13, 2016
Tweet

More Decks by Sergey Koren

Other Decks in Science

Transcript

  1. Sergey Koren, @sergekoren
    Genome Informatics Section, NHGRI
    Single-molecule assembly for genomes large
    and small

    View Slide

  2. Theorem: Long reads solve everything
    Proof
    Everything is a metagenome
    Assembly solves metagenomics
    Long reads solve assembly
    Long reads solve everything *
    *This is hyperbole

    View Slide

  3. ! Want to find an optimal tour
    !  Consistent with the original fragments
    !  Fragment arrival rate is consistent with a Poisson
    process
    !  Read-pair distance and orientation is preserved,
    etc…
    ! Many possible paths
    !  Use a variety of heuristics to find good path
    !  Output only the high-confidence paths through
    the graph (contigs)
    ! Hard to assemble, easier to validate
    !  Run a bunch and pick the best!
    Mark Chaisson
    Short read assembly is hard
    Automated ensemble assembly and validation of microbial genomes.
    Koren et al. (2014) BMC Bioinformatics

    View Slide

  4. ! Repeats cause tangles and cycles in the graph
    !  A randomly generated sequence is trivial to assemble
    !  Repeats shorter than the read length don’t matter
    ! “It”
    ! “It was”
    ! “It was the best”
    ! “It was the best of times”
    ! “With his hands in his pockets”
    Why is assembly hard?
    >1,000 SSR
    320 TE
    2 SegDup
    1 Unique
    3 Meta

    View Slide

  5. How long are microbial repeats?
    Golden Threshold
    Reducing assembly complexity of microbial genomes with single-molecule sequencing.
    Koren et al. (2013) Genome Research

    View Slide

  6. k = 1,000 k = 7,000
    Golden Threshold
    P5C3 (84% > 7Kbp)
    How long do reads need to be?
    k = 50
    One chromosome, one contig. Koren and Phillippy (2015) Curr Opin Microbiol

    View Slide

  7. Hybrid error correction and de novo assembly
    of single-molecule sequencing reads
    Koren et al. (2012) Nature Biotechnology
    Reducing assembly complexity of microbial genomes
    with single-molecule sequencing
    Koren et al. (2013) Genome Biology
    Assembling Large Genomes with Single-Molecule Sequencing
    and Locality Sensitive Hashing
    Berlin et al. (2015) Nature Biotechnology
    With Canu:
    25x of PacBio P6C4 achieves:
    > 90% of bacteria assemble without gaps
    > QV40 (99.99%) consensus accuracy
    < 15 minutes of compute
    < $1,000 total cost
    Long read assembly

    View Slide

  8. Eukaryotic assembly
    Error correction and assembly complexity of single molecule sequencing reads. Lee et al.
    (2014) bioRxiv

    View Slide

  9. Improvements in:
    Y heterochromatin
    Telomeres, centromeres
    TE and repeat resolution
    New biology discovered!
    Complete assembly of
    Chromosome 3L
    D. melanogaster Assembly

    View Slide

  10. Drosophila melanogaster Y
    Casey Bergman, Sue Celniker, Jason Chin, Jane Landolin Tracey Chapman

    View Slide

  11. Birth of a new gene on the Y
    chromosome of Drosophila
    melanogaster
    Carvalho et al. (2015)
    Finally, we emphasize the utility of PacBio technology in dealing with
    difficult genomic regions: as was the case with the Mst77Y region,
    [MHAP] produced a seemingly error-free assembly of the FDY region,
    something that has eluded us for years of hard work.
    Here we describe flagrante delicto Y (FDY), a very young gene that
    shows how Y-linked genes were acquired. FDY originated 2 million
    years ago from a duplication of a contiguous autosomal segment of
    11 kb containing five genes that inserted into the Y chromosome.
    Four of these autosome-to-Y gene copies became inactivated
    (“pseudogenes”), lost part of their sequences, and most likely will
    disappear in the next few million years. FDY, originally a female-
    biased gene, acquired testis expression and remained functional.

    View Slide

  12. Anopheles gambiae Y
    Nora Besanksy, ND
    Sam Cotton
    Igor Sharakhov

    View Slide

  13. 1 2 3 4 5 6 7 8 9 10 11 12
    13 14 15 16 17 18 19 20 21 22 X
    CHM1 Canu
    Human genome assembly solved?

    View Slide

  14. Building a better reference
    2.1
    2.11
    2.12
    2.13
    2.14
    2.15
    2.16
    2.17
    2.18
    2.19
    100 200 300 400 500 600
    #NA12878 - #CHM13:125 hets (×106)
    #CHM13:125 heterozygous SNPs (×103)
    CHM13-0983455
    CHM13-0983465
    CHM13-0983475
    CHM13-1015355
    CHM13-1015385
    CHM1-0772585
    CHM1-1.1
    CHM1-1007805
    hs37m
    hs38
    huref
    Heng Li, Personal Communication

    View Slide

  15. Genetic variation and the de novo
    assembly of human genomes
    Chaisson, Wilson, Eichler (2015)
    emphasize the importance of complete de novo assembly as opposed to read
    mapping as the primary means to understanding the full range of human
    genetic variation.

    View Slide

  16. Structural Analysis in Cancers
    Alejandro Gutierrez & Sarah Morton
    Mike Schatz & Maria Nattestad

    View Slide

  17. Contigs ≠ Genome

    View Slide

  18. Bridging the gap
    From http://www.bionanogenomics.com/technology/irys-technology/
    Bionano Genomics HiC
    Crosslink) Fragment) Proximity)
    Liga4on)
    Sequence)
    Junc4ons)
    adapted from Lieberman-Aiden, et. al. Science, 2009

    View Slide

  19. 3D model of
    genome
    Hi-C can be used for 3D modeling and scaffolding of
    genome assemblies
    Duan, et al. Nature, 2010
    Burton, et al. Nature Biotech, 2013
    1
    2
    3
    4
    5
    6
    7
    8
    9
    1 2 3 4 5 6 7 8 9
    chromosome
    chromosome
    From I. Liachko
    Haplotype
    phasing
    Selvaraj, et al. Nature Biotech, 2013
    Crosslink) Fragment) Proximity)
    Liga4on)
    Sequence)
    Junc4ons)
    Genome
    scaffolding

    View Slide

  20. Goat PacBio+BioNano vs RefV2
    RefV2 PacBio PacBio + BioNano PacBio +
    BioNano + HiC
    Ctgs
    # bp 2.72Gbp 2.63Gbp 2.62Gbp 2.63Gb
    # ctg 173,141 3,096 2,349 1,522
    Max 679,126 35,623,478 57,511,119 66,489,255
    N50 73,533 4,473,169 12,102,227 23,340,314
    Scfs
    # bp 2.69Gbp 2.63Gbp 2.62Gbp 2.63Gbp
    # scf 30 3,096 2,084 525
    Max 161,917,960 35,623,478 66,727,870 157,517,791
    N50 103,731,018 4,473,169 14,265,070 91,787,174

    View Slide

  21. chr1
    chr8 chrX
    chr28
    PacBio+BioNano+HiC to reference

    View Slide

  22. Continuing speedups
    D. melanogaster
    H. sapiens
    ?

    View Slide

  23. Canu
    Overlap MHAP: MinHash alignment
    Correct PBCR: Corrected reads
    Trim OBT: Overlap based trimming
    Assemble BOGART: Best overlap graph
    Consensus UTGCNS: Refactored, >100X faster
    Polish Quiver / Nanopolish

    View Slide

  24. Conclusions & Future Work
    ! Centromeres/telomeres remain a challenge
    !  HiC fails due to low interactions
    !  Bionano fails due to missing restriction sites
    ! Improved algorithms to combine technologies
    !  Assembler output
    ! Assembling populations
    !  Graph based assembly and formats
    !  Long read polishing and phasing

    View Slide

  25. Acknowledgements
    !  Canu
    !  Adam Phillippy
    !  Brian Walenz
    !  MHAP
    !  Konstatin Berlin
    !  Goat
    !  Derek M. Bickhart
    !  Adam M Phillippy
    !  Timothy P.L. Smith
    !  Shawn T. Sullivan
    !  Ivan Liachko
    !  Joshua N. Burton
    !  Maitreya J. Dunham
    !  Jay Shendure
    !  Alex R. Hastie
    !  Brian L. Sayre
    !  Heather J Huson
    !  George E. Liu
    !  Benjamin D. Rosen
    !  Steven G. Schroeder
    !  Curtis P. VanTassell
    !  Tad S. Sonstegard
    !  Dermatofibrosarcoma
    !  Alejandro Gutierrez
    !  Sarah Morton
    !  Mike Schatz
    !  Maria Nattestad
    !  Fritz Sedlazeck
    !  A. gambiae
    !  Andrew Hall
    !  Philippos-Aris Papathanos
    !  Atashi Sharma
    !  Changde Cheng
    !  Omar Akbari
    !  Lauren Assour
    !  Nicholas Bergman
    !  Alessia Cagnetti
    !  Andrea Crisanti
    !  Tania Dottorini
    !  Elisa Fiorentini
    !  Roberto Galizi
    !  Jonathan Hnath
    !  Xiaofang Jiang
    !  Tony Nolan
    !  Diana Radune
    !  Maria Sharakhova
    !  Aaron Steele
    !  Vladimir A. Timoshevskiy
    !  Nikolai Windbichler
    !  Simo Zhangl
    !  Matthew W. Hahn
    !  Scott J. Emrich
    !  Igor V. Sharakhov
    !  Zhijian Tu
    !  Nora J. Besansky
    !  D. melanogaster
    !  Casey Bergman
    !  Sue Celniker
    !  Jason Chin
    !  Jane Landolin
    !  NHGRI
    ! Postdocs wanted!
    !  Genome Informatics Section
    !  Assembly
    !  Structural variation
    !  Infectious disease
    !  Undiagnosed disease
    !  http://www.genome.gov/27563366
    /MarBL

    View Slide

  26. PUBLIC DOMAIN NOTICE
    This presentation is "United States Government Work" under the
    terms of the United States Copyright Act. It was written as part of
    the authors' official duties for the United States Government and
    thus cannot be copyrighted. This presentation is freely available to
    the public for use without a copyright notice. Restrictions cannot
    be placed on its present or future use.
    Although all reasonable efforts have been taken to ensure the
    accuracy and reliability of the presentation and associated data,
    the National Human Genome Research Institute (NHGRI),
    National Institutes of Health (NIH) and the U.S. Government do
    not and cannot warrant the performance or results that may be
    obtained based on this presentation or data. NHGRI, NIH and the
    U.S. Government disclaim all warranties as to performance,
    merchantability or fitness for any particular purpose. Please cite
    the authors in any work or product based on this material.

    View Slide