$30 off During Our Annual Pro Sale. View Details »

PAG XXIV: From Sequencing to Chromosomes: New de novo Assembly and Scaffolding Methods Improve the Goat Reference Genome

Sergey Koren
January 11, 2016

PAG XXIV: From Sequencing to Chromosomes: New de novo Assembly and Scaffolding Methods Improve the Goat Reference Genome

Single-molecule sequencing is now routinely used to assemble complete, high-quality microbial genomes, but these assembly methods have not scaled well to large genomes. To address this problem, we previously introduced the MinHash Alignment Process (MHAP) for overlapping single-molecule reads using probabilistic, locality-sensitive hashing. Integrating MHAP with Celera Assembler (CA) has enabled reference-grade assemblies of model organisms, revealing novel heterochromatic sequences and filling low-complexity gap sequences in the GRCh38 human reference genome. We have applied our methods to assemble the San Clemente goat genome. Combining single-molecule sequencing from Pacific Biosciences and BioNano Genomics generates and assembly that is over 150-fold more contiguous than the latest Capra hircus reference. In combination with Hi-C sequencing, the assembly surpasses reference assemblies, de novo, with minimal manual intervention. The autosomes are each assembled into a single scaffold. Our assembly provides a more complete gene reconstruction, better alignments with Goat 52k chip, and improved allosome reconstruction. In addition to providing increased continuity of sequence, our assembly achieves a higher BUSCO completion score (84%) than the existing goat reference assembly suggesting better quality annotation of gene models. Our results demonstrate that single-molecule sequencing can produce near-complete eukaryotic genomes at modest cost and minimal manual effort.

Sergey Koren

January 11, 2016
Tweet

More Decks by Sergey Koren

Other Decks in Science

Transcript

  1. From sequencing to chromosomes: new de
    novo assembly and scaffolding methods
    improve the goat reference genome
    Sergey Koren, @sergekoren
    Genome Informatics Section, NHGRI

    View Slide

  2. Hybrid error correction and de novo assembly
    of single-molecule sequencing reads
    Koren et al. (2012) Nature Biotechnology
    Reducing assembly complexity of microbial genomes
    with single-molecule sequencing
    Koren et al. (2013) Genome Biology
    Assembling Large Genomes with Single-Molecule Sequencing
    and Locality Sensitive Hashing
    Berlin et al. (2015) Nature Biotechnology
    With Canu:
    25x of PacBio P6C4 achieves:
    > 90% of bacteria assemble without gaps
    > QV40 (99.99%) consensus accuracy
    < 15 minutes of compute
    < $1,000 total cost
    Long read assembly

    View Slide

  3. 1 2 3 4 5 6 7 8 9 10 11 12
    13 14 15 16 17 18 19 20 21 22 X
    CHM1 Canu
    Human genome assembly solved?

    View Slide

  4. Contigs ≠ Genome

    View Slide

  5. Goat PacBio vs RefV2
    PacBio RefV2
    Ctgs
    # bp 2.63Gbp 2.72Gbp
    # ctg 3,096 173,141
    Max 35,623,478 679,126
    N50 4,473,169 73,533
    Scfs
    # bp 2.63Gbp 2.69Gbp
    # scf 3,096 30
    Max 35,623,478 161,917,960
    N50 4,473,169 103,731,018

    View Slide

  6. ! CHIR 1.0 annotation LiftOver split mappings
    !  Example split gene
    ! Comparative mapping with sheep
    O. aries exons Exclusive PacBio Exclusive BGI Shared
    Total Mappings 9,534 564 225,365
    Unmapped 564 9,534 9,140
    Split Exons 592 1,930 1,528
    CHIR 1.0
    Sheep
    Antelope
    Cow
    Annotation comparison

    View Slide

  7. Algorithms Convert Images
    into Molecules
    Assembly Algorithms Align Molecules de novo for
    Constructing Consensus Genome Maps
    Cross-Mapping Across Multiple
    Samples or to a Reference
    High Throughput, High Resolution Imaging Gives Contiguous Molecules up to Mb Length
    •  Automated SV Detection
    •  Gap Sizing
    •  Genome Finishing
    Isolate
    High Molecular Weight DNA
    Label Specific
    Sequences Across
    the Entire Genome
    Transfer Labeled DNA into
    Cartridge for Scanning
    Load, Linearize & Image Labeled DNA in
    Repeated Cycling to Scan Whole Genome
    Insertion
    Customer Sample
    Irys Workflow for Genome Mapping
    Irys® Workflow- Overview
    © 2015 BioNano Genomics

    View Slide

  8. Goat PacBio+BioNano vs RefV2
    PacBio PacBio + BioNano RefV2
    Ctgs
    # bp 2.63Gbp 2.62Gbp 2.72Gbp
    # ctg 3,096 2,349 173,141
    Max 35,623,478 57,511,119 679,126
    N50 4,473,169 12,102,227 73,533
    Scfs
    # bp 2.63Gbp 2.62Gbp 2.69Gbp
    # scf 3,096 2,084 30
    Max 35,623,478 66,727,870 161,917,960
    N50 4,473,169 14,265,070 103,731,018

    View Slide

  9. chr1
    chr8 chrX
    chr28
    PacBio+BioNano to reference

    View Slide

  10. 3D model of
    genome
    Hi-C can be used for 3D modeling and scaffolding of
    genome assemblies
    Duan, et al. Nature, 2010
    Genome
    scaffolding
    Burton, et al. Nature Biotech, 2013
    1
    2
    3
    4
    5
    6
    7
    8
    9
    1 2 3 4 5 6 7 8 9
    chromosome
    chromosome
    From I. Liachko
    Haplotype
    phasing
    Selvaraj, et al. Nature Biotech, 2013
    Crosslink) Fragment) Proximity)
    Liga4on)
    Sequence)
    Junc4ons)

    View Slide

  11. Goat PacBio+BioNano vs RefV2
    PacBio PacBio + BioNano PacBio + BioNano + HiC RefV2
    Ctgs
    # bp 2.63Gbp 2.62Gbp 2.63Gb 2.72Gbp
    # ctg 3,096 2,349 1,522 173,141
    Max 35,623,478 57,511,119 66,489,255 679,126
    N50 4,473,169 12,102,227 23,340,314 73,533
    Scfs
    # bp 2.63Gbp 2.62Gbp 2.63Gbp 2.69Gbp
    # scf 3,096 2,084 525 30
    Max 35,623,478 66,727,870 157,517,791 161,917,960
    N50 4,473,169 14,265,070 91,787,174 103,731,018

    View Slide

  12. chr1
    chr8 chrX
    chr28
    PacBio+BioNano+HiC to reference

    View Slide

  13. chr1
    chr8 chrX
    chr28
    PacBio+BioNano+HiC+curation

    View Slide

  14. Acknowledgements
    !  Canu
    !  Adam Phillippy
    !  Brian Walenz
    !  Goat Project
    !  Derek M. Bickhart
    !  Adam M Phillippy
    !  Timothy P.L. Smith
    !  Shawn T. Sullivan
    !  Ivan Liachko
    !  Joshua N. Burton
    !  Maitreya J. Dunham
    !  Jay Shendure
    !  Alex R. Hastie
    !  Brian L. Sayre
    !  Heather J Huson
    !  George E. Liu
    !  Benjamin D. Rosen
    !  Steven G. Schroeder
    !  Curtis P. VanTassell
    !  Tad S. Sonstegard
    !  NHGRI
    ! Postdocs wanted!
    !  Genome Informatics Section
    !  Assembly
    !  Structural variation
    !  Infectious disease
    !  Undiagnosed disease
    !  http://www.genome.gov/27563366
    /MarBL

    View Slide

  15. PUBLIC DOMAIN NOTICE
    This presentation is "United States Government Work" under the
    terms of the United States Copyright Act. It was written as part of
    the authors' official duties for the United States Government and
    thus cannot be copyrighted. This presentation is freely available to
    the public for use without a copyright notice. Restrictions cannot
    be placed on its present or future use.
    Although all reasonable efforts have been taken to ensure the
    accuracy and reliability of the presentation and associated data,
    the National Human Genome Research Institute (NHGRI),
    National Institutes of Health (NIH) and the U.S. Government do
    not and cannot warrant the performance or results that may be
    obtained based on this presentation or data. NHGRI, NIH and the
    U.S. Government disclaim all warranties as to performance,
    merchantability or fitness for any particular purpose. Please cite
    the authors in any work or product based on this material.

    View Slide