Assembling Human Genome in 100 Minutes

5633e4eaa009d960042a8f32b55b3d7f?s=47 Jason Chin
May 23, 2019

Assembling Human Genome in 100 Minutes

How can we do super-fast genome assembly with a "SHIMMER" indexing scheme?


De novo genome assembly is the most unbiased way to acquire comprehensive genomic information and to gain insight for new DNA sequences that may not exist in reference genomes. Many de novo human genomes are published in the last couple of years leveraging cheaper short-read and single-molecule long-read technologies . Along with the scale of sequencing work, the computation burden persists for generating assemblies. The most common long-read assembly framework using overlap- layout-consensus paradigm requires all-to-all read comparisons. The computation complexity of this comparison step scales quadratically with the number of reads. Most methods still require hundreds to thousands of CPU hours although various techniques have been developed to reject non-overlapped pairs fast or to reduce the extra computation for repeats. High computation requirement persists for more accurate long reads (accuracy ~99% and length ~11 to 15k), which is achievable with current sequencing technologies.
We introduce the de novo assembler Peregrine , which uses a novel minimizer based read index schema. This allows the removal of the all-to-all read comparisons. Instead, read pairs with high overlapping probability are gathered in one step and compared by utilizing the index. In our initial implementation, we can assemble 28x to 32x human PacBio CCS read datasets in less than 20 cpu hours and two wall-clock hours to high contiguity (N50 > 20Mb). The continues advent of sequencing technologies in terms of read length and based accuracy together with Peregrine will enable routine generation of human de novo assemblies. This leads to more comprehensive representation of the genomic variations on population scale beyond SNPs and small indels. We further applied Peregrine successfully to non mammalian genomes such as plants. Future implementations will enable the usage of less accurate long reads such as Oxford Nanopore and longer PacBio reads.


Jason Chin

May 23, 2019


  1. Jason Chin, Asif Khalak (Twitter: @infoecho, @AsifKhalak) Foundation of Biological

    Data Science Sequencing, Finishing and Analysis in the Future Meeting, May 23, 2019 Assembling Human Genome in 100 Minutes
  2. The Dawn of Human Genome Assembly Supercomputer used for Celera

    Genomics’ first WGS human Assembly in early 2000
  3. Fast Forward to 2014, The Dawn of Long Noisy Read

    Genome Assembly u Two Overlap Steps: u For error correction u For assembly graph construction u First human genome assembly done this way took 50+ CPU years.
  4. It Is Getting Better u After HGAP for long read

    assembly, the community starts to build more adequate overlappers designed to be more efficient for noisy long reads u Konstantin Berlin, et. al. MHAP, Locality Sensitive Hashing -> Canu Assembler u Gene Myers, daligner, cache coherence, high performance distribute computing modules -> FALCON Assembler u Myself, some earlier attempts that had never seen the light u More to come after 2015 u MECAT, Canu, miniasm2, wtdbg2, flye, shasta
  5. What Can We Do With a Bit More Accurate (>99%)

    and Longer (>10kb) Sequences for Assembly? u Current assemblers can assemble the better data faster (Not surprising) u Most assemblers still need read-to-read comparison which has a computing time complexity ~ O(n2), n: number of reads u New approach like WTDBG2 can do assembly significantly faster than others. N50 = 12Mb to 29Mb depending on data quality / parameters CPU time ~ 200 – 3000 cpu hours
  6. What Can We Really Do? Get Rid of O(n2) For

    OLC Assemblers!! u Genome is most “linear” u Genome assembly ~ sorting reads in a linear coordinate along the chromosomes u No one uses O(n2) algorithm for sorting u Radix Sort has complexity linear to the number of items u How can we make genome assembly more like Radix Sort than Bubble Sort? u A very simple ideal: an indexing structure to group reads that highly like to overlap with each other fast!! u We need some efficient way to “bin” the reads
  7. Minimizer Is An Awesome Data Structure We still need to

    very efficient way to compute very sparse minimizers to reduce the numbers of bins.
  8. One day I had this private discussion with Heng Li.

  9. Sparse & HIerarchical MniMizER (SHIMMER) Indexing Sequence K-mer over moving

    windows Hash Values Level-0 Minimizers Level-1 Minimizers Level-2 Minimizers Digest for longer sequence Smaller Index Size Larger Index Size
  10. Building SHIMMER Along a Read

  11. Mapping Neighboring Minimizers to Reads u Two reads that are

    overlapped with each other are likely to share a co-linear set of SHIMMERs. Read 1 Read 2 Shared Neighboring Minimizer Pair Build hash-map F: (Minimizer Pair) → [ Read1, Read2, …] for all reads that have the same neighboring minimizer pairs Inconsistency caused by errors Shared Neighboring Minimizer Pair
  12. None
  13. Grouping Reads By Neighboring Minimizer Pairs Shared Minimizer Pair A

    group that are likely overlapped with each other Confirm overlaps by base-to-base or minimizer-to-minimizer alignment Overlaps to Assembly String Graph Contigs Complexity ~ O(nc2) instead of O(n2) n: number of reads c: coverage We have c2 ≪ n Number of reads grouped ~ sequence coverage
  14. ~ 30x human data

  15. Shimmer Index Used for Fast Reads to Draft Contig Alignment

    for Consensus Re-use the Shimmer Index for the reads Build the Shimmer Index for the draft contigs Shared Neighboring Minimizer Pairs Full genome consensus polishing with “falcon-sense” DAG consensus module Super fast mapping for consensus
  16. CPU Usage For A Couple Human Genomes With Different Parameter

    Sets 0.000 5.000 10.000 15.000 20.000 25.000 hg002-16-80-6-2 hg002-16-80-4-2 hg002-16-80-18- 1 hg002-16-80-36- 1 hg002_sequel2- 16-80-6-2 chm13-16-80-4- 2 chm13-16-80-3- 2 pgp1-16-80-4-2 pgp1-16-80-3-2 consensus 8.290 7.325 6.736 7.023 4.397 7.453 6.297 6.610 7.145 assembly 0.716 0.800 0.743 0.641 0.463 0.630 0.682 0.323 0.365 overlapping 6.687 11.880 7.767 3.646 4.139 6.822 8.897 8.151 10.846 indexing 0.493 0.511 0.491 0.494 0.256 0.443 0.452 0.405 0.403 seqdb building 0.073 0.080 0.081 0.075 0.027 0.067 0.061 0.045 0.047 CPU HOURS CPU HOURS FOR DIFFERENT GENOMES & PARAMETER SETS Wall clock time is from 1 to 2 hours using 24 cores on m5d.metal or r5d.12xlarge on AWS.
  17. Contiguity 27,848,727 33,364,927 26,459,768 6,746,698 21,936,975 29,260,433 33,307,555 27,316,706 28,222,972

    0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 40,000,000 hg002-16-80-6-2 hg002-16-80-4-2 hg002-16-80-18-1 hg002-16-80-36-1 hg002_sequl2-16-80-6-2 chm 13-16-80-4-2 chm 13-16-80-3-2 pgp1-16-80-4-2 pgp1-16-80-3-2 N50
  18. Accuracy Assessment 0 10 20 30 40 50 60 AC270115.1

    AC270117.1 AC270118.1 AC270119.1 AC270120.1 AC270122.1 AC270131.1 AC270132.1 AC270133.1 AC270134.1 AC270135.1 AC270136.1 AC270137.1 AC270145.1 AC270146.1 AC270238.1 AC275285.1 AC275291.1 AC275297.1 AC275298.1 AC275300.1 AC275301.1 AC275304.1 AC275305.1 AC278482.1 AC278741.1 AC278929.1 AC279018.1 AC279070.1 Accuracy Evaluation for CHM13 Assembly Draft Contig Polished Contig Homo sapiens BAC clone VMRC59 Concordance (Phred QV scale)
  19. More Repetitive Genomes Chanos chanos milkfish Salmo trutta brown trout

    N50: 800 kb Assembly Size: 2.3G Assembly Time: 70 mins N50: 2.4 Mb Assembly Size: 705M Assembly Time: 44 mins ?
  20. Executable with Docker $ find /wd/pgp-1-data/ -name "*.fasta" | sort

    > pgp-1-seqdata.lst $ docker run -it -v /wd:/wd cschin/peregrine: \ asm /wd/pgp-1-seqdata.lst 24 24 24 24 24 24 24 24 24 \ --with-consensus \ --shimmer-r 3 --best_n_ovlp 8 \ --output /wd/pgp-1-asm-r3-pg0.1.5.0 Follow @infoecho, @BDSFoundation for update
  21. New “Short” Reads are Helpful to Correct Super Long Reads

    u PacBio’s CCS is useful for correcting super long Oxford Nanopore reads u Example on the right shows a 800k corrected read aligned at 99.8% accuracy u Shimmer index will help to assembly such super long high accuracy reads super fast
  22. Summary & Future Development u As we get into pan-genomics

    era for human and other larger genomes, relive the computation burden for assembly is crucial for speeding up research. u Peregrine’s overlap time complexity is O(nc2) instead of O(n2), 10 to 100 times faster than any existing end-to-end assemblers on good quality data u The consensus module is capable to generate high accuracy contigs (QV >45) without any slow signal level polishing step u With current sequencing technologies, it is possible to assemble 3G bp genome in 100 minutes and it will become even faster in the future. We can be more focusing on solving biological problems than the computational ones. u Future work: u Diploid resolution, perhaps port FALCON-Unzip module to Peregrine u Support super long read assembly u Pan-genomics analysis using SHIMMER Index for fast whole genome alignments
  23. Acknowledgement u Mike Hunkapiller (PacBio), Paul Peluso (PacBio) for generating

    and providing free hg002 data u Glennis Logsdon, Mitchell Vollger, Eichler Lab for CHM13 data u Church Lab & Shilpa Garg for PGP-1 data u Chris Dunn for PypeFlow update u Arkarachai Fungtammasan (DNAnexus) for assembly evaluation u Mike Schatz and Heng Li for discussion Thanks For Your Attention
  24. @BDSFoundation