Speeding up Genome Assembly with Sparse and Hierarchical Minimizer Indices

Slide 1

Slide 1 text

Jason Chin, Asif Khalak (Twitter Handle: @infoecho, @AsifKhalak) Foundation of Biological Data Science (http://biologicaldatascience.org/) Plant and Animal Genomics, San Diego, Jan 12, 2020 Speeding up Genome Assembly with Sparse and Hierarchical Minimizer Indices

Slide 2

Slide 2 text

Disclaimer I am currently an employee of DNAnexus. However, the information presented here is solely my own point of view and does not represent the viewpoint of DNAnexus or any other parties who may be mentioned in this presentation.

Slide 3

Slide 3 text

The Dawn of Long Noisy Read Genome Assembly: Hierarchical Genome Assembly Process u Based on “Inter-molecule” error-correction/consensus u Two Overlap Steps: u For error correction u For assembly graph construction u First human genome assembly done this way took 50+ CPU years.

Slide 4

Slide 4 text

Making Faster Assemblers With “Inter-molecule Consensus” u more efficient overlapper for long noisy reads u MHAP (Berlin, et al.), Locality Sensitive Hashing u Daligner (Myers), cache coherence, high performance distribute computing modules u Led to faster assemblers u 2015: FALCON (based on daligner), Canu (based on MHAP) u post-2015: MECAT2, Miniasm2, wtdbg2/RedBean, Flye, Shasta

Slide 5

Slide 5 text

“Intra-molecule” Consensus Takes a Long Time To Fruition: Better Long and Accurate Reads 10-year development It is possible to generate > 15kb > 99% accuracy reads from single molecule sequencing. https://twitter.com/PacBio/status/1136314570 514022401

Slide 6

Slide 6 text

How Fast Can We Do Genome Assembly Now?

Slide 7

Slide 7 text

Claim: Human De Novo Genome Assembly in 100 Minutes Genome Approach Total N50 Max # seq Theoretical Ideal n/a 3.200Gb 249.0Mb 249.0Mb 23 HG002 (GCA_001542345.1) Celera 2.987 Gb 4.5Mb 35.2Mb 12302 HG002 current 2.897 Gb-2.915 Gb 26.5Mb –35.3Mb 102.0Mb –108Mb 2384 –2571 chm13 (GCA_000983455.1) daligner/ FALCON 2.851 Gb 13.0Mb 63.14Mb 2873 chm13 current 2.839 Gb 29.2 Mb –33.3 Mb 95.4Mb –102.0Mb 844 - 978 NA12878 (GCA_002077035.1) FALCON 2.858 Gb 14.5 Mb 52.4 Mb 3635 NA12878 current 2.880 Gb 25.4 Mb 109.8 Mb 3061 H002 (GIAB, resampled 2018) - Male boy of parent/child triplet - Pacbio Sequel CHM13 (Wash U, 2015) - Adult female - Pacbio RS2 NA12878 (Wash U, 2015) - Adult female - Pacbio RS2 Current Human Genome Assemblies in GenBank

Slide 8

Slide 8 text

0.000 5.000 10.000 15.000 20.000 25.000 hg002-16-80-6-2 hg002-16-80-4-2 hg002-16-80-18- 1 hg002-16-80-36- 1 hg002_sequel2- 16-80-6-2 chm13-16-80-4- 2 chm13-16-80-3- 2 pgp1-16-80-4-2 pgp1-16-80-3-2 consensus 8.290 7.325 6.736 7.023 4.397 7.453 6.297 6.610 7.145 assembly 0.716 0.800 0.743 0.641 0.463 0.630 0.682 0.323 0.365 overlapping 6.687 11.880 7.767 3.646 4.139 6.822 8.897 8.151 10.846 indexing 0.493 0.511 0.491 0.494 0.256 0.443 0.452 0.405 0.403 seqdb building 0.073 0.080 0.081 0.075 0.027 0.067 0.061 0.045 0.047 CPU HOURS CPU HOURS FOR DIFFERENT GENOMES & PARAMETER SETS Wall clock time is from 1 to 2 hours using 24 cores on m5d.metal or r5d.12xlarge on AWS. Claim: Human De Novo Genome Assembly in 100 Minutes Assembly Performance on PacBio 2019 CCS HG002 Dataset https://github.com/cschin/Peregrine

Slide 9

Slide 9 text

Accuracy Assessment (CHM13 dataset, 25x Coverage) Error rate estimation using VMRC59 BAC library for the CHM13 assembly

Slide 10

Slide 10 text

Contiguity Assessment (CHM13 dataset) Contiguity of Peregrine (current) assembly meets or exceeds traditional approach total size: 2,860,555,444 max size: 95,429,376 N50 size: 33,305,254 N90 size: 4,575,882 Number of Contigs: 1,800 Contigs

Slide 11

Slide 11 text

How?

Slide 12

Slide 12 text

First, why is de novo assembly typically slow? N2 read i read j For the “Overlap-Layout-Consensus” paradigm, we compare all reads against all reads ~ N2 comparison. We can speed up by rejecting un- match pairs faster, but we still have check all pairs.

Slide 13

Slide 13 text

Learn From Experts Who Solve Jigsaw Puzzle Efficiently https://www.youtube.com/watch?v=oRlCNXdcMc0

Slide 14

Slide 14 text

Why is Peregrine fast? SHIMMER Indices Rapidly Group Reads Into Many Specific Categories Like the efficient way to solve jigsaw puzzles, we group reads with unique “Spare Hierarchical MiniMizERs” (SHIMMER) pairs. Matching SHIMMER indices that are 600bp apart dramatically reduces the overlap complexity vs matching raw reads

Slide 15

Slide 15 text

Grouping Reads By SHIMMER Pairs Shared SHIMMER Pair A group of reads that are likely overlapped with each other Confirm overlaps by base-to-base or minimizer-to-minimizer alignment Overlaps to Assembly String Graph Contigs Complexity ~ O(GC2/d) instead of O(N2)~O(G2C2/L2) N: number of reads G: genome size ~ 109 d: distance between MP ~ 500 L: read length ~ 104 C: coverage O(GC2/d) ~ 5000x faster than O(N2) Number of reads grouped ~ sequence coverage Consensus (included in Peregrine)

Slide 16

Slide 16 text

Check out the preprint and the GitHub repository for algorithm details https://www.biorxiv.org/content/10.1101 /705616v1 https://github.com/cschin/Peregrine/

Slide 17

Slide 17 text

More Repetitive Genomes Chanos chanos milkfish Salmo trutta brown trout N50: 800 kb Assembly Size: 2.3G Assembly Time: 70 mins N50: 2.4 Mb Assembly Size: 705M Assembly Time: 44 mins Jamaica Lion Cannabis Strain

Slide 18

Slide 18 text

Genome Assembly @ Home: Mac vs. PC Mac Pro6,1 6 core Intel Xeon E5 64G Ram 1T Flash Drive PC (Homemade) 4 core Intel i5-7600K 32G RAM 512 G nvme SSD

Slide 19

Slide 19 text

Assembly Results u 30.6 Gb input sequences (~30x), average length = 20009 bp u Assembly Time: u PC, using 2 cores, 5.6 hours u MAC, with flash drive < 7 hours, without flash drive ~ 13 hours Total 976,491,714 n50 4,034,814 n90 412,051 With CP & MT Complete BUSCOs (C) 1998 (94.2%) 2063 (97.3%) Complete and single-copy BUSCOs (S) 1524 (71.9%) 1576 (74.3%) Complete and duplicated BUSCOs (D) 474 (22.23%) 487 (23.0%) Fragmented BUSCOs (F) 27 (1.3%) 19 (0.9%) Missing BUSCOs (M) 96 (4.5%) 39 (1.8 %) Total BUSCO groups searched 2121 2121 A Jamaica Lion Cannabis Strain Draft Assembly

Slide 20

Slide 20 text

Assembly Started with the 30Gb fasta file Assembly Finished in 50 Minutes 24 core peak usage After Minor Optimization Total Assembly Size: 950.3 Mb Primary Contig N50: 8.1 Mb Longest Contig: 40.4 Mb Genome Assembly In The Cloud

Slide 21

Slide 21 text

• 1U, Duel CPU (Xeon E5-2630 V4, total 20 cores ) • 512G Memory, 10T NVNE SSD • CentOS, GPU ready Finish The Jamaica Lion Cannabis Assembly in ~70 Minutes Genome Assembly On Premises Customized 1U Server Built with Integrated Peregrine Assembler Workshop by ExaAI (https://www.exaai.io/en/)

Slide 22

Slide 22 text

THCAS/CBDAS/ CBDAS2 cluster identified in a 3Mb region of a contig All reads Reads that cover THCAS/ CBDAS/CBDAS2

Slide 23

Slide 23 text

Repeat structures around the THCAS/CBDAS clusters revealed by the assembly

Slide 24

Slide 24 text

Comparing Two Haplotypes of The THCAS/CBDAS Cluster Primary contig alt. contig THCAS/CBDAS Cluster 5 copies in the alt. contig 4 copies in the primary contig

Slide 25

Slide 25 text

Summary & Future Development u As we get into pan-genomics era for larger genomes, relieve the computation burden for assembly is crucial for speeding up research. u The consensus module is capable to generate high accuracy contigs (QV >45) without any slow signal level polishing step u With current sequencing technologies, it is possible to assemble 3G bp genome in 100 minutes and it will become even faster in the future. We can be more focusing on solving biological problems than the computational ones. u Future work: u Diploid resolution, integrate biallelic variants phasing u Support super long read assembly u Pan-genomics analysis using SHIMMER Index for fast whole genome alignments

Slide 26

Slide 26 text

@BDSFoundation Jason Chin, Asif Khalak (Twitter Handle: @infoecho, @AsifKhalak) http://biologicaldatascience.org/

Slide 27

Slide 27 text

Acknowledgements u Kevin McKernan (Medicinal Genomics), Jamaica Lion data set u Charlie Hou (ExaAI) for computing support u Mike Hunkapiller (PacBio), Paul Peluso (PacBio) for generating and providing free hg002 data u Glennis Logsdon, Mitchell Vollger, Eichler Lab for CHM13 data u Church Lab & Shilpa Garg for PGP-1 data u Dario Cantu u Kevin Fengler u Chris Dunn for PypeFlow update u Arkarachai Fungtammasan (DNAnexus) for assembly evaluation u Mike Schatz and Heng Li for discussion Thanks For Your Attention