Speeding up Genome Assembly with Sparse and Hierarchical Minimizer Indices

Jason Chin, Asif Khalak (Twitter Handle: @infoecho, @AsifKhalak) Foundation of
Biological Data Science (http://biologicaldatascience.org/) Plant and Animal Genomics, San Diego, Jan 12, 2020 Speeding up Genome Assembly with Sparse and Hierarchical Minimizer Indices

Disclaimer I am currently an employee of DNAnexus. However, the
information presented here is solely my own point of view and does not represent the viewpoint of DNAnexus or any other parties who may be mentioned in this presentation.

The Dawn of Long Noisy Read Genome Assembly: Hierarchical Genome
Assembly Process u Based on “Inter-molecule” error-correction/consensus u Two Overlap Steps: u For error correction u For assembly graph construction u First human genome assembly done this way took 50+ CPU years.

Making Faster Assemblers With “Inter-molecule Consensus” u more efficient overlapper
for long noisy reads u MHAP (Berlin, et al.), Locality Sensitive Hashing u Daligner (Myers), cache coherence, high performance distribute computing modules u Led to faster assemblers u 2015: FALCON (based on daligner), Canu (based on MHAP) u post-2015: MECAT2, Miniasm2, wtdbg2/RedBean, Flye, Shasta

“Intra-molecule” Consensus Takes a Long Time To Fruition: Better Long
and Accurate Reads 10-year development It is possible to generate > 15kb > 99% accuracy reads from single molecule sequencing. https://twitter.com/PacBio/status/1136314570 514022401

How Fast Can We Do Genome Assembly Now?

Claim: Human De Novo Genome Assembly in 100 Minutes Genome
Approach Total N50 Max # seq Theoretical Ideal n/a 3.200Gb 249.0Mb 249.0Mb 23 HG002 (GCA_001542345.1) Celera 2.987 Gb 4.5Mb 35.2Mb 12302 HG002 current 2.897 Gb-2.915 Gb 26.5Mb –35.3Mb 102.0Mb –108Mb 2384 –2571 chm13 (GCA_000983455.1) daligner/ FALCON 2.851 Gb 13.0Mb 63.14Mb 2873 chm13 current 2.839 Gb 29.2 Mb –33.3 Mb 95.4Mb –102.0Mb 844 - 978 NA12878 (GCA_002077035.1) FALCON 2.858 Gb 14.5 Mb 52.4 Mb 3635 NA12878 current 2.880 Gb 25.4 Mb 109.8 Mb 3061 H002 (GIAB, resampled 2018) - Male boy of parent/child triplet - Pacbio Sequel CHM13 (Wash U, 2015) - Adult female - Pacbio RS2 NA12878 (Wash U, 2015) - Adult female - Pacbio RS2 Current Human Genome Assemblies in GenBank

0.000 5.000 10.000 15.000 20.000 25.000 hg002-16-80-6-2 hg002-16-80-4-2 hg002-16-80-18- 1
hg002-16-80-36- 1 hg002_sequel2- 16-80-6-2 chm13-16-80-4- 2 chm13-16-80-3- 2 pgp1-16-80-4-2 pgp1-16-80-3-2 consensus 8.290 7.325 6.736 7.023 4.397 7.453 6.297 6.610 7.145 assembly 0.716 0.800 0.743 0.641 0.463 0.630 0.682 0.323 0.365 overlapping 6.687 11.880 7.767 3.646 4.139 6.822 8.897 8.151 10.846 indexing 0.493 0.511 0.491 0.494 0.256 0.443 0.452 0.405 0.403 seqdb building 0.073 0.080 0.081 0.075 0.027 0.067 0.061 0.045 0.047 CPU HOURS CPU HOURS FOR DIFFERENT GENOMES & PARAMETER SETS Wall clock time is from 1 to 2 hours using 24 cores on m5d.metal or r5d.12xlarge on AWS. Claim: Human De Novo Genome Assembly in 100 Minutes Assembly Performance on PacBio 2019 CCS HG002 Dataset https://github.com/cschin/Peregrine

Accuracy Assessment (CHM13 dataset, 25x Coverage) Error rate estimation using
VMRC59 BAC library for the CHM13 assembly

Contiguity Assessment (CHM13 dataset) Contiguity of Peregrine (current) assembly meets
or exceeds traditional approach total size: 2,860,555,444 max size: 95,429,376 N50 size: 33,305,254 N90 size: 4,575,882 Number of Contigs: 1,800 Contigs

First, why is de novo assembly typically slow? N2 read
i read j For the “Overlap-Layout-Consensus” paradigm, we compare all reads against all reads ~ N2 comparison. We can speed up by rejecting un- match pairs faster, but we still have check all pairs.

Learn From Experts Who Solve Jigsaw Puzzle Efficiently https://www.youtube.com/watch?v=oRlCNXdcMc0

Why is Peregrine fast? SHIMMER Indices Rapidly Group Reads Into
Many Specific Categories Like the efficient way to solve jigsaw puzzles, we group reads with unique “Spare Hierarchical MiniMizERs” (SHIMMER) pairs. Matching SHIMMER indices that are 600bp apart dramatically reduces the overlap complexity vs matching raw reads

Grouping Reads By SHIMMER Pairs Shared SHIMMER Pair A group
of reads that are likely overlapped with each other Confirm overlaps by base-to-base or minimizer-to-minimizer alignment Overlaps to Assembly String Graph Contigs Complexity ~ O(GC2/d) instead of O(N2)~O(G2C2/L2) N: number of reads G: genome size ~ 109 d: distance between MP ~ 500 L: read length ~ 104 C: coverage O(GC2/d) ~ 5000x faster than O(N2) Number of reads grouped ~ sequence coverage Consensus (included in Peregrine)

Check out the preprint and the GitHub repository for algorithm
details https://www.biorxiv.org/content/10.1101 /705616v1 https://github.com/cschin/Peregrine/

More Repetitive Genomes Chanos chanos milkfish Salmo trutta brown trout
N50: 800 kb Assembly Size: 2.3G Assembly Time: 70 mins N50: 2.4 Mb Assembly Size: 705M Assembly Time: 44 mins Jamaica Lion Cannabis Strain

Genome Assembly @ Home: Mac vs. PC Mac Pro6,1 6
core Intel Xeon E5 64G Ram 1T Flash Drive PC (Homemade) 4 core Intel i5-7600K 32G RAM 512 G nvme SSD

Assembly Results u 30.6 Gb input sequences (~30x), average length
= 20009 bp u Assembly Time: u PC, using 2 cores, 5.6 hours u MAC, with flash drive < 7 hours, without flash drive ~ 13 hours Total 976,491,714 n50 4,034,814 n90 412,051 With CP & MT Complete BUSCOs (C) 1998 (94.2%) 2063 (97.3%) Complete and single-copy BUSCOs (S) 1524 (71.9%) 1576 (74.3%) Complete and duplicated BUSCOs (D) 474 (22.23%) 487 (23.0%) Fragmented BUSCOs (F) 27 (1.3%) 19 (0.9%) Missing BUSCOs (M) 96 (4.5%) 39 (1.8 %) Total BUSCO groups searched 2121 2121 A Jamaica Lion Cannabis Strain Draft Assembly

Assembly Started with the 30Gb fasta file Assembly Finished in
50 Minutes 24 core peak usage After Minor Optimization Total Assembly Size: 950.3 Mb Primary Contig N50: 8.1 Mb Longest Contig: 40.4 Mb Genome Assembly In The Cloud

• 1U, Duel CPU (Xeon E5-2630 V4, total 20 cores
) • 512G Memory, 10T NVNE SSD • CentOS, GPU ready Finish The Jamaica Lion Cannabis Assembly in ~70 Minutes Genome Assembly On Premises Customized 1U Server Built with Integrated Peregrine Assembler Workshop by ExaAI (https://www.exaai.io/en/)

THCAS/CBDAS/ CBDAS2 cluster identified in a 3Mb region of a
contig All reads Reads that cover THCAS/ CBDAS/CBDAS2

Repeat structures around the THCAS/CBDAS clusters revealed by the assembly

Comparing Two Haplotypes of The THCAS/CBDAS Cluster Primary contig alt.
contig THCAS/CBDAS Cluster 5 copies in the alt. contig 4 copies in the primary contig

Summary & Future Development u As we get into pan-genomics
era for larger genomes, relieve the computation burden for assembly is crucial for speeding up research. u The consensus module is capable to generate high accuracy contigs (QV >45) without any slow signal level polishing step u With current sequencing technologies, it is possible to assemble 3G bp genome in 100 minutes and it will become even faster in the future. We can be more focusing on solving biological problems than the computational ones. u Future work: u Diploid resolution, integrate biallelic variants phasing u Support super long read assembly u Pan-genomics analysis using SHIMMER Index for fast whole genome alignments

@BDSFoundation Jason Chin, Asif Khalak (Twitter Handle: @infoecho, @AsifKhalak) http://biologicaldatascience.org/

Acknowledgements u Kevin McKernan (Medicinal Genomics), Jamaica Lion data set
u Charlie Hou (ExaAI) for computing support u Mike Hunkapiller (PacBio), Paul Peluso (PacBio) for generating and providing free hg002 data u Glennis Logsdon, Mitchell Vollger, Eichler Lab for CHM13 data u Church Lab & Shilpa Garg for PGP-1 data u Dario Cantu u Kevin Fengler u Chris Dunn for PypeFlow update u Arkarachai Fungtammasan (DNAnexus) for assembly evaluation u Mike Schatz and Heng Li for discussion Thanks For Your Attention

Speeding up Genome Assembly with Sparse and Hie...

Speeding up Genome Assembly with Sparse and Hierarchical Minimizer Indices

Jason Chin

More Decks by Jason Chin

Other Decks in Science

Featured

Transcript

Jason Chin, Asif Khalak (Twitter Handle: @infoecho, @AsifKhalak) Foundation of

Disclaimer I am currently an employee of DNAnexus. However, the

The Dawn of Long Noisy Read Genome Assembly: Hierarchical Genome

Making Faster Assemblers With “Inter-molecule Consensus” u more efficient overlapper

“Intra-molecule” Consensus Takes a Long Time To Fruition: Better Long

How Fast Can We Do Genome Assembly Now?

Claim: Human De Novo Genome Assembly in 100 Minutes Genome

0.000 5.000 10.000 15.000 20.000 25.000 hg002-16-80-6-2 hg002-16-80-4-2 hg002-16-80-18- 1

Accuracy Assessment (CHM13 dataset, 25x Coverage) Error rate estimation using

Contiguity Assessment (CHM13 dataset) Contiguity of Peregrine (current) assembly meets

How?

First, why is de novo assembly typically slow? N2 read

Learn From Experts Who Solve Jigsaw Puzzle Efficiently https://www.youtube.com/watch?v=oRlCNXdcMc0

Why is Peregrine fast? SHIMMER Indices Rapidly Group Reads Into

Grouping Reads By SHIMMER Pairs Shared SHIMMER Pair A group

Check out the preprint and the GitHub repository for algorithm

More Repetitive Genomes Chanos chanos milkfish Salmo trutta brown trout

Genome Assembly @ Home: Mac vs. PC Mac Pro6,1 6

Assembly Results u 30.6 Gb input sequences (~30x), average length

Assembly Started with the 30Gb fasta file Assembly Finished in

• 1U, Duel CPU (Xeon E5-2630 V4, total 20 cores

THCAS/CBDAS/ CBDAS2 cluster identified in a 3Mb region of a

Repeat structures around the THCAS/CBDAS clusters revealed by the assembly

Comparing Two Haplotypes of The THCAS/CBDAS Cluster Primary contig alt.

Summary & Future Development u As we get into pan-genomics

@BDSFoundation Jason Chin, Asif Khalak (Twitter Handle: @infoecho, @AsifKhalak) http://biologicaldatascience.org/

Acknowledgements u Kevin McKernan (Medicinal Genomics), Jamaica Lion data set