Large, Complex Plant Genomes with HiFi Sequencing

Large, Complex Plant Genomes with HiFi Sequencing

Examples of highly accurate long reads (HiFi reads) resolving large and complex plant genomes.

48d7bea44d9fbb0c11479f703adf7b35?s=128

Michelle Vierra

May 13, 2020
Tweet

Transcript

  1. For Research Use Only. Not for use in diagnostic procedures.

    © Copyright 2020 by Pacific Biosciences of California, Inc. All rights reserved. Large and Complex Plant Genomes with HiFi Sequencing Michelle Vierra – Manager, Plant and Animal Sciences May 2020
  2. WHAT ARE HIFI READS? -They are long -Tens of kilobases

    -They are accurate -Long reads with ≥Q20 (99%) single-molecule accuracy -They have single-molecule resolution -Sequence DNA or RNA -They are unbiased -No DNA amplification, least GC content and sequence complexity bias
  3. ADDRESSING QUESTIONS ABOUT HIFI READS Will HiFi work on complex

    plant genomes? The 3 C’s of Genome Quality
  4. THE REDWOOD GENOME IS LARGE AND COMPLEX 3 Gb diploid

    9 Gb hexaploid 9x the size of the human genome! 54 Gb of DNA content! 6 Gb of DNA content!
  5. THE PROJECT WORKFLOW Sample Prep Sequencing Assembly 4 days 7

    days 6 days - Collected ~80 g of needles and froze - Extracted 56 µg of DNA with Circulomics kit - Prepped 2 HiFi libraries at ~25 kb - Ran 31 SMRT Cells 8M across 9 instruments in 7 days - Streamed data for immediate CCS analysis conversion to HiFi reads - Used hifiasm for quick, haplotype aware assembly 17 DAYS
  6. CHECKING AGAINST A BASELINE HiFi exceeds results of ONT +

    short reads in all basic stats 1. Sequencing and assembling mega-genomes of mega-trees: the giant sequoia and coast redwood genomes 2. Transcript set of Abies alba from Neale, D. et al. Varying number of transcripts aligned to each genome (4,958 mapped to PacBio HiFi redwood, 4,760 mapped to ONT redwood) California Redwood Genome Assembly Results Methodology PacBio HiFi ONT + short reads1 Genome Coverage 22-fold 23-fold + 122-fold Assembly Size (Gb) 47.7 26.5 Contig N50 (Mb) 1.92 0.11 BUSCO Complete 59% 56% Mapped transcripts with frameshift errors2 0.12% 1.97%
  7. THE THREE C’S OF GENOME QUALITY – REDWOOD RESULTS *Conifer

    genomes have very large introns that make BUSCO an inefficient measure of completeness, since it makes out ~70% - 1.92 Mb contig N50 - No gaps - >5X the haploid genome size (resolving most of the hexaploidy) - 59% of BUSCO genes complete* - Only 0.12% of mapped transcripts resulting in frameshift errors
  8. OTHER LARGE AND/OR COMPLEX PLANT HIFI ASSEMBLIES Diploid plant 1

    Diploid plant 2 Maize Oat Genome size 3.2 Gb 3.2 Gb 2.5 Gb 11 Gb Library size 20 kb 20 kb 17 kb 17 kb Coverage 21-fold 16-fold 20-fold 22-fold Contig N50 12 Mb 7 Mb 14.7 Mb 20.3 Mb Assembly time <1 day <1 day 6 hours 12 hours We see consistently good results across a wide array of complex plant genomes with assemblies complete in less than a day!
  9. OTHER LARGE AND/OR COMPLEX PLANT HIFI ASSEMBLIES Assembling the tetraploid

    rose genome with HiFi Watch the full presentation: The impact of highly accurate PacBio sequence data on the assembly of a tetraploid rose “We managed to assemble a heterozygous, polyploid genome, without the need for ultra high molecular weight DNA, which is required for a lot of other long-read sequencing”
  10. OTHER LARGE AND/OR COMPLEX PLANT HIFI ASSEMBLIES Assembling the Cannabis

    genome with HiFi compared to Long Reads Metrics Cannabis Long Reads Cannabis HiFi Primary Alt. Haplotype Primary Alt. Haplotype Assembly size 999 Mb 184.7 Mb 991 Mb 290 Mb Contig N50 3.5 Mb 0.2 Mb 8.6 Mb 0.69 Mb BUSCO Complete 97.4% 24.4% 98.3% 40.1% CPU Hours 8,326 - 248 - HiFi sequencing assembled more of the alternative haplotype, captured more complete genes, and took >33-times less CPU hours. All Cannabis genomes made public by Kevin McKernan at Medicinal Genomics: https://www.medicinalgenomics.com/jamaican-lion-data-release/
  11. HIGH-QUALITY, COST EFFECTIVE SEQUENCING FOR PLANTS pacb.com/agbio -Highly accurate long

    reads with minimum accuracy of Q20 (99%) -High contiguity and base quality genome assemblies -Small file sizes and fast analysis time -Assemble up to a 2.5 Gb genome in a single SMRT Cell 8M for ~$1,300 -Run up to 200 samples (2.5 Gb) per year, per Sequel II System
  12. For Research Use Only. Not for use in diagnostic procedures.

    © Copyright 2020 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. Pacific Biosciences does not sell a kit for carrying out the overall No-Amp Targeted Sequencing method. Use of these No-Amp methods may require rights to third-party owned intellectual property. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners. www.pacb.com