Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PAG XXIV: From Sequencing to Chromosomes: New de novo Assembly and Scaffolding Methods Improve the Goat Reference Genome

Sergey Koren
January 11, 2016

PAG XXIV: From Sequencing to Chromosomes: New de novo Assembly and Scaffolding Methods Improve the Goat Reference Genome

Single-molecule sequencing is now routinely used to assemble complete, high-quality microbial genomes, but these assembly methods have not scaled well to large genomes. To address this problem, we previously introduced the MinHash Alignment Process (MHAP) for overlapping single-molecule reads using probabilistic, locality-sensitive hashing. Integrating MHAP with Celera Assembler (CA) has enabled reference-grade assemblies of model organisms, revealing novel heterochromatic sequences and filling low-complexity gap sequences in the GRCh38 human reference genome. We have applied our methods to assemble the San Clemente goat genome. Combining single-molecule sequencing from Pacific Biosciences and BioNano Genomics generates and assembly that is over 150-fold more contiguous than the latest Capra hircus reference. In combination with Hi-C sequencing, the assembly surpasses reference assemblies, de novo, with minimal manual intervention. The autosomes are each assembled into a single scaffold. Our assembly provides a more complete gene reconstruction, better alignments with Goat 52k chip, and improved allosome reconstruction. In addition to providing increased continuity of sequence, our assembly achieves a higher BUSCO completion score (84%) than the existing goat reference assembly suggesting better quality annotation of gene models. Our results demonstrate that single-molecule sequencing can produce near-complete eukaryotic genomes at modest cost and minimal manual effort.

Sergey Koren

January 11, 2016
Tweet

More Decks by Sergey Koren

Other Decks in Science

Transcript

  1. From sequencing to chromosomes: new de novo assembly and scaffolding

    methods improve the goat reference genome Sergey Koren, @sergekoren Genome Informatics Section, NHGRI
  2. Hybrid error correction and de novo assembly of single-molecule sequencing

    reads Koren et al. (2012) Nature Biotechnology Reducing assembly complexity of microbial genomes with single-molecule sequencing Koren et al. (2013) Genome Biology Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing Berlin et al. (2015) Nature Biotechnology With Canu: 25x of PacBio P6C4 achieves: > 90% of bacteria assemble without gaps > QV40 (99.99%) consensus accuracy < 15 minutes of compute < $1,000 total cost Long read assembly
  3. 1 2 3 4 5 6 7 8 9 10

    11 12 13 14 15 16 17 18 19 20 21 22 X CHM1 Canu Human genome assembly solved?
  4. Goat PacBio vs RefV2 PacBio RefV2 Ctgs # bp 2.63Gbp

    2.72Gbp # ctg 3,096 173,141 Max 35,623,478 679,126 N50 4,473,169 73,533 Scfs # bp 2.63Gbp 2.69Gbp # scf 3,096 30 Max 35,623,478 161,917,960 N50 4,473,169 103,731,018
  5. ! CHIR 1.0 annotation LiftOver split mappings !  Example split gene

    ! Comparative mapping with sheep O. aries exons Exclusive PacBio Exclusive BGI Shared Total Mappings 9,534 564 225,365 Unmapped 564 9,534 9,140 Split Exons 592 1,930 1,528 CHIR 1.0 Sheep Antelope Cow Annotation comparison
  6. Algorithms Convert Images into Molecules Assembly Algorithms Align Molecules de

    novo for Constructing Consensus Genome Maps Cross-Mapping Across Multiple Samples or to a Reference High Throughput, High Resolution Imaging Gives Contiguous Molecules up to Mb Length •  Automated SV Detection •  Gap Sizing •  Genome Finishing Isolate High Molecular Weight DNA Label Specific Sequences Across the Entire Genome Transfer Labeled DNA into Cartridge for Scanning Load, Linearize & Image Labeled DNA in Repeated Cycling to Scan Whole Genome Insertion Customer Sample Irys Workflow for Genome Mapping Irys® Workflow- Overview © 2015 BioNano Genomics
  7. Goat PacBio+BioNano vs RefV2 PacBio PacBio + BioNano RefV2 Ctgs

    # bp 2.63Gbp 2.62Gbp 2.72Gbp # ctg 3,096 2,349 173,141 Max 35,623,478 57,511,119 679,126 N50 4,473,169 12,102,227 73,533 Scfs # bp 2.63Gbp 2.62Gbp 2.69Gbp # scf 3,096 2,084 30 Max 35,623,478 66,727,870 161,917,960 N50 4,473,169 14,265,070 103,731,018
  8. 3D model of genome Hi-C can be used for 3D

    modeling and scaffolding of genome assemblies Duan, et al. Nature, 2010 Genome scaffolding Burton, et al. Nature Biotech, 2013 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 chromosome chromosome From I. Liachko Haplotype phasing Selvaraj, et al. Nature Biotech, 2013 Crosslink) Fragment) Proximity) Liga4on) Sequence) Junc4ons)
  9. Goat PacBio+BioNano vs RefV2 PacBio PacBio + BioNano PacBio +

    BioNano + HiC RefV2 Ctgs # bp 2.63Gbp 2.62Gbp 2.63Gb 2.72Gbp # ctg 3,096 2,349 1,522 173,141 Max 35,623,478 57,511,119 66,489,255 679,126 N50 4,473,169 12,102,227 23,340,314 73,533 Scfs # bp 2.63Gbp 2.62Gbp 2.63Gbp 2.69Gbp # scf 3,096 2,084 525 30 Max 35,623,478 66,727,870 157,517,791 161,917,960 N50 4,473,169 14,265,070 91,787,174 103,731,018
  10. Acknowledgements !  Canu !  Adam Phillippy !  Brian Walenz ! 

    Goat Project !  Derek M. Bickhart !  Adam M Phillippy !  Timothy P.L. Smith !  Shawn T. Sullivan !  Ivan Liachko !  Joshua N. Burton !  Maitreya J. Dunham !  Jay Shendure !  Alex R. Hastie !  Brian L. Sayre !  Heather J Huson !  George E. Liu !  Benjamin D. Rosen !  Steven G. Schroeder !  Curtis P. VanTassell !  Tad S. Sonstegard !  NHGRI ! Postdocs wanted! !  Genome Informatics Section !  Assembly !  Structural variation !  Infectious disease !  Undiagnosed disease !  http://www.genome.gov/27563366 /MarBL
  11. PUBLIC DOMAIN NOTICE This presentation is "United States Government Work"

    under the terms of the United States Copyright Act. It was written as part of the authors' official duties for the United States Government and thus cannot be copyrighted. This presentation is freely available to the public for use without a copyright notice. Restrictions cannot be placed on its present or future use. Although all reasonable efforts have been taken to ensure the accuracy and reliability of the presentation and associated data, the National Human Genome Research Institute (NHGRI), National Institutes of Health (NIH) and the U.S. Government do not and cannot warrant the performance or results that may be obtained based on this presentation or data. NHGRI, NIH and the U.S. Government disclaim all warranties as to performance, merchantability or fitness for any particular purpose. Please cite the authors in any work or product based on this material.