Speeding up Genome Assembly with Sparse and Hierarchical Minimizer Indices

5633e4eaa009d960042a8f32b55b3d7f?s=47 Jason Chin
January 14, 2020

Speeding up Genome Assembly with Sparse and Hierarchical Minimizer Indices

I presented this slide deck in Plant and Animal Genomics Meeting (#PAG2020 #PAGXXVIII) 2020 in San Diego. We use a novel data structure to speed up genome assembly. It becomes possible to do large genome assemblies on a home computer. In the cloud or on a server, we will be able to assembly complicated genome like Cannabis in one to two hours.


Jason Chin

January 14, 2020


  1. Jason Chin, Asif Khalak (Twitter Handle: @infoecho, @AsifKhalak) Foundation of

    Biological Data Science (http://biologicaldatascience.org/) Plant and Animal Genomics, San Diego, Jan 12, 2020 Speeding up Genome Assembly with Sparse and Hierarchical Minimizer Indices
  2. Disclaimer I am currently an employee of DNAnexus. However, the

    information presented here is solely my own point of view and does not represent the viewpoint of DNAnexus or any other parties who may be mentioned in this presentation.
  3. The Dawn of Long Noisy Read Genome Assembly: Hierarchical Genome

    Assembly Process u Based on “Inter-molecule” error-correction/consensus u Two Overlap Steps: u For error correction u For assembly graph construction u First human genome assembly done this way took 50+ CPU years.
  4. Making Faster Assemblers With “Inter-molecule Consensus” u more efficient overlapper

    for long noisy reads u MHAP (Berlin, et al.), Locality Sensitive Hashing u Daligner (Myers), cache coherence, high performance distribute computing modules u Led to faster assemblers u 2015: FALCON (based on daligner), Canu (based on MHAP) u post-2015: MECAT2, Miniasm2, wtdbg2/RedBean, Flye, Shasta
  5. “Intra-molecule” Consensus Takes a Long Time To Fruition: Better Long

    and Accurate Reads 10-year development It is possible to generate > 15kb > 99% accuracy reads from single molecule sequencing. https://twitter.com/PacBio/status/1136314570 514022401
  6. How Fast Can We Do Genome Assembly Now?

  7. Claim: Human De Novo Genome Assembly in 100 Minutes Genome

    Approach Total N50 Max # seq Theoretical Ideal n/a 3.200Gb 249.0Mb 249.0Mb 23 HG002 (GCA_001542345.1) Celera 2.987 Gb 4.5Mb 35.2Mb 12302 HG002 current 2.897 Gb-2.915 Gb 26.5Mb –35.3Mb 102.0Mb –108Mb 2384 –2571 chm13 (GCA_000983455.1) daligner/ FALCON 2.851 Gb 13.0Mb 63.14Mb 2873 chm13 current 2.839 Gb 29.2 Mb –33.3 Mb 95.4Mb –102.0Mb 844 - 978 NA12878 (GCA_002077035.1) FALCON 2.858 Gb 14.5 Mb 52.4 Mb 3635 NA12878 current 2.880 Gb 25.4 Mb 109.8 Mb 3061 H002 (GIAB, resampled 2018) - Male boy of parent/child triplet - Pacbio Sequel CHM13 (Wash U, 2015) - Adult female - Pacbio RS2 NA12878 (Wash U, 2015) - Adult female - Pacbio RS2 Current Human Genome Assemblies in GenBank
  8. 0.000 5.000 10.000 15.000 20.000 25.000 hg002-16-80-6-2 hg002-16-80-4-2 hg002-16-80-18- 1

    hg002-16-80-36- 1 hg002_sequel2- 16-80-6-2 chm13-16-80-4- 2 chm13-16-80-3- 2 pgp1-16-80-4-2 pgp1-16-80-3-2 consensus 8.290 7.325 6.736 7.023 4.397 7.453 6.297 6.610 7.145 assembly 0.716 0.800 0.743 0.641 0.463 0.630 0.682 0.323 0.365 overlapping 6.687 11.880 7.767 3.646 4.139 6.822 8.897 8.151 10.846 indexing 0.493 0.511 0.491 0.494 0.256 0.443 0.452 0.405 0.403 seqdb building 0.073 0.080 0.081 0.075 0.027 0.067 0.061 0.045 0.047 CPU HOURS CPU HOURS FOR DIFFERENT GENOMES & PARAMETER SETS Wall clock time is from 1 to 2 hours using 24 cores on m5d.metal or r5d.12xlarge on AWS. Claim: Human De Novo Genome Assembly in 100 Minutes Assembly Performance on PacBio 2019 CCS HG002 Dataset https://github.com/cschin/Peregrine
  9. Accuracy Assessment (CHM13 dataset, 25x Coverage) Error rate estimation using

    VMRC59 BAC library for the CHM13 assembly
  10. Contiguity Assessment (CHM13 dataset) Contiguity of Peregrine (current) assembly meets

    or exceeds traditional approach total size: 2,860,555,444 max size: 95,429,376 N50 size: 33,305,254 N90 size: 4,575,882 Number of Contigs: 1,800 Contigs
  11. How?

  12. First, why is de novo assembly typically slow? N2 read

    i read j For the “Overlap-Layout-Consensus” paradigm, we compare all reads against all reads ~ N2 comparison. We can speed up by rejecting un- match pairs faster, but we still have check all pairs.
  13. Learn From Experts Who Solve Jigsaw Puzzle Efficiently https://www.youtube.com/watch?v=oRlCNXdcMc0

  14. Why is Peregrine fast? SHIMMER Indices Rapidly Group Reads Into

    Many Specific Categories Like the efficient way to solve jigsaw puzzles, we group reads with unique “Spare Hierarchical MiniMizERs” (SHIMMER) pairs. Matching SHIMMER indices that are 600bp apart dramatically reduces the overlap complexity vs matching raw reads
  15. Grouping Reads By SHIMMER Pairs Shared SHIMMER Pair A group

    of reads that are likely overlapped with each other Confirm overlaps by base-to-base or minimizer-to-minimizer alignment Overlaps to Assembly String Graph Contigs Complexity ~ O(GC2/d) instead of O(N2)~O(G2C2/L2) N: number of reads G: genome size ~ 109 d: distance between MP ~ 500 L: read length ~ 104 C: coverage O(GC2/d) ~ 5000x faster than O(N2) Number of reads grouped ~ sequence coverage Consensus (included in Peregrine)
  16. Check out the preprint and the GitHub repository for algorithm

    details https://www.biorxiv.org/content/10.1101 /705616v1 https://github.com/cschin/Peregrine/
  17. More Repetitive Genomes Chanos chanos milkfish Salmo trutta brown trout

    N50: 800 kb Assembly Size: 2.3G Assembly Time: 70 mins N50: 2.4 Mb Assembly Size: 705M Assembly Time: 44 mins Jamaica Lion Cannabis Strain
  18. Genome Assembly @ Home: Mac vs. PC Mac Pro6,1 6

    core Intel Xeon E5 64G Ram 1T Flash Drive PC (Homemade) 4 core Intel i5-7600K 32G RAM 512 G nvme SSD
  19. Assembly Results u 30.6 Gb input sequences (~30x), average length

    = 20009 bp u Assembly Time: u PC, using 2 cores, 5.6 hours u MAC, with flash drive < 7 hours, without flash drive ~ 13 hours Total 976,491,714 n50 4,034,814 n90 412,051 With CP & MT Complete BUSCOs (C) 1998 (94.2%) 2063 (97.3%) Complete and single-copy BUSCOs (S) 1524 (71.9%) 1576 (74.3%) Complete and duplicated BUSCOs (D) 474 (22.23%) 487 (23.0%) Fragmented BUSCOs (F) 27 (1.3%) 19 (0.9%) Missing BUSCOs (M) 96 (4.5%) 39 (1.8 %) Total BUSCO groups searched 2121 2121 A Jamaica Lion Cannabis Strain Draft Assembly
  20. Assembly Started with the 30Gb fasta file Assembly Finished in

    50 Minutes 24 core peak usage After Minor Optimization Total Assembly Size: 950.3 Mb Primary Contig N50: 8.1 Mb Longest Contig: 40.4 Mb Genome Assembly In The Cloud
  21. • 1U, Duel CPU (Xeon E5-2630 V4, total 20 cores

    ) • 512G Memory, 10T NVNE SSD • CentOS, GPU ready Finish The Jamaica Lion Cannabis Assembly in ~70 Minutes Genome Assembly On Premises Customized 1U Server Built with Integrated Peregrine Assembler Workshop by ExaAI (https://www.exaai.io/en/)
  22. THCAS/CBDAS/ CBDAS2 cluster identified in a 3Mb region of a

    contig All reads Reads that cover THCAS/ CBDAS/CBDAS2
  23. Repeat structures around the THCAS/CBDAS clusters revealed by the assembly

  24. Comparing Two Haplotypes of The THCAS/CBDAS Cluster Primary contig alt.

    contig THCAS/CBDAS Cluster 5 copies in the alt. contig 4 copies in the primary contig
  25. Summary & Future Development u As we get into pan-genomics

    era for larger genomes, relieve the computation burden for assembly is crucial for speeding up research. u The consensus module is capable to generate high accuracy contigs (QV >45) without any slow signal level polishing step u With current sequencing technologies, it is possible to assemble 3G bp genome in 100 minutes and it will become even faster in the future. We can be more focusing on solving biological problems than the computational ones. u Future work: u Diploid resolution, integrate biallelic variants phasing u Support super long read assembly u Pan-genomics analysis using SHIMMER Index for fast whole genome alignments
  26. @BDSFoundation Jason Chin, Asif Khalak (Twitter Handle: @infoecho, @AsifKhalak) http://biologicaldatascience.org/

  27. Acknowledgements u Kevin McKernan (Medicinal Genomics), Jamaica Lion data set

    u Charlie Hou (ExaAI) for computing support u Mike Hunkapiller (PacBio), Paul Peluso (PacBio) for generating and providing free hg002 data u Glennis Logsdon, Mitchell Vollger, Eichler Lab for CHM13 data u Church Lab & Shilpa Garg for PGP-1 data u Dario Cantu u Kevin Fengler u Chris Dunn for PypeFlow update u Arkarachai Fungtammasan (DNAnexus) for assembly evaluation u Mike Schatz and Heng Li for discussion Thanks For Your Attention