Assembling Human Genome in 100 Minutes

Jason Chin, Asif Khalak (Twitter: @infoecho, @AsifKhalak) Foundation of Biological
Data Science Sequencing, Finishing and Analysis in the Future Meeting, May 23, 2019 Assembling Human Genome in 100 Minutes

The Dawn of Human Genome Assembly Supercomputer used for Celera
Genomics’ first WGS human Assembly in early 2000

Fast Forward to 2014, The Dawn of Long Noisy Read
Genome Assembly u Two Overlap Steps: u For error correction u For assembly graph construction u First human genome assembly done this way took 50+ CPU years.

It Is Getting Better u After HGAP for long read
assembly, the community starts to build more adequate overlappers designed to be more efficient for noisy long reads u Konstantin Berlin, et. al. MHAP, Locality Sensitive Hashing -> Canu Assembler u Gene Myers, daligner, cache coherence, high performance distribute computing modules -> FALCON Assembler u Myself, some earlier attempts that had never seen the light u More to come after 2015 u MECAT, Canu, miniasm2, wtdbg2, flye, shasta

What Can We Do With a Bit More Accurate (>99%)
and Longer (>10kb) Sequences for Assembly? u Current assemblers can assemble the better data faster (Not surprising) u Most assemblers still need read-to-read comparison which has a computing time complexity ~ O(n2), n: number of reads u New approach like WTDBG2 can do assembly significantly faster than others. N50 = 12Mb to 29Mb depending on data quality / parameters CPU time ~ 200 – 3000 cpu hours

What Can We Really Do? Get Rid of O(n2) For
OLC Assemblers!! u Genome is most “linear” u Genome assembly ~ sorting reads in a linear coordinate along the chromosomes u No one uses O(n2) algorithm for sorting u Radix Sort has complexity linear to the number of items u How can we make genome assembly more like Radix Sort than Bubble Sort? u A very simple ideal: an indexing structure to group reads that highly like to overlap with each other fast!! u We need some efficient way to “bin” the reads

Minimizer Is An Awesome Data Structure We still need to
very efficient way to compute very sparse minimizers to reduce the numbers of bins.

One day I had this private discussion with Heng Li.

Sparse & HIerarchical MniMizER (SHIMMER) Indexing Sequence K-mer over moving
windows Hash Values Level-0 Minimizers Level-1 Minimizers Level-2 Minimizers Digest for longer sequence Smaller Index Size Larger Index Size

Building SHIMMER Along a Read

Mapping Neighboring Minimizers to Reads u Two reads that are
overlapped with each other are likely to share a co-linear set of SHIMMERs. Read 1 Read 2 Shared Neighboring Minimizer Pair Build hash-map F: (Minimizer Pair) → [ Read1, Read2, …] for all reads that have the same neighboring minimizer pairs Inconsistency caused by errors Shared Neighboring Minimizer Pair

Grouping Reads By Neighboring Minimizer Pairs Shared Minimizer Pair A
group that are likely overlapped with each other Confirm overlaps by base-to-base or minimizer-to-minimizer alignment Overlaps to Assembly String Graph Contigs Complexity ~ O(nc2) instead of O(n2) n: number of reads c: coverage We have c2 ≪ n Number of reads grouped ~ sequence coverage

~ 30x human data

Shimmer Index Used for Fast Reads to Draft Contig Alignment
for Consensus Re-use the Shimmer Index for the reads Build the Shimmer Index for the draft contigs Shared Neighboring Minimizer Pairs Full genome consensus polishing with “falcon-sense” DAG consensus module Super fast mapping for consensus

CPU Usage For A Couple Human Genomes With Different Parameter
Sets 0.000 5.000 10.000 15.000 20.000 25.000 hg002-16-80-6-2 hg002-16-80-4-2 hg002-16-80-18- 1 hg002-16-80-36- 1 hg002_sequel2- 16-80-6-2 chm13-16-80-4- 2 chm13-16-80-3- 2 pgp1-16-80-4-2 pgp1-16-80-3-2 consensus 8.290 7.325 6.736 7.023 4.397 7.453 6.297 6.610 7.145 assembly 0.716 0.800 0.743 0.641 0.463 0.630 0.682 0.323 0.365 overlapping 6.687 11.880 7.767 3.646 4.139 6.822 8.897 8.151 10.846 indexing 0.493 0.511 0.491 0.494 0.256 0.443 0.452 0.405 0.403 seqdb building 0.073 0.080 0.081 0.075 0.027 0.067 0.061 0.045 0.047 CPU HOURS CPU HOURS FOR DIFFERENT GENOMES & PARAMETER SETS Wall clock time is from 1 to 2 hours using 24 cores on m5d.metal or r5d.12xlarge on AWS.

Contiguity 27,848,727 33,364,927 26,459,768 6,746,698 21,936,975 29,260,433 33,307,555 27,316,706 28,222,972
0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 40,000,000 hg002-16-80-6-2 hg002-16-80-4-2 hg002-16-80-18-1 hg002-16-80-36-1 hg002_sequl2-16-80-6-2 chm 13-16-80-4-2 chm 13-16-80-3-2 pgp1-16-80-4-2 pgp1-16-80-3-2 N50

Accuracy Assessment 0 10 20 30 40 50 60 AC270115.1
AC270117.1 AC270118.1 AC270119.1 AC270120.1 AC270122.1 AC270131.1 AC270132.1 AC270133.1 AC270134.1 AC270135.1 AC270136.1 AC270137.1 AC270145.1 AC270146.1 AC270238.1 AC275285.1 AC275291.1 AC275297.1 AC275298.1 AC275300.1 AC275301.1 AC275304.1 AC275305.1 AC278482.1 AC278741.1 AC278929.1 AC279018.1 AC279070.1 Accuracy Evaluation for CHM13 Assembly Draft Contig Polished Contig Homo sapiens BAC clone VMRC59 Concordance (Phred QV scale)

More Repetitive Genomes Chanos chanos milkfish Salmo trutta brown trout
N50: 800 kb Assembly Size: 2.3G Assembly Time: 70 mins N50: 2.4 Mb Assembly Size: 705M Assembly Time: 44 mins ?

Executable with Docker $ find /wd/pgp-1-data/ -name "*.fasta" | sort
> pgp-1-seqdata.lst $ docker run -it -v /wd:/wd cschin/peregrine:0.1.5.0 \ asm /wd/pgp-1-seqdata.lst 24 24 24 24 24 24 24 24 24 \ --with-consensus \ --shimmer-r 3 --best_n_ovlp 8 \ --output /wd/pgp-1-asm-r3-pg0.1.5.0 https://github.com/cschin/Peregrine/blob/master/README.md Follow @infoecho, @BDSFoundation for update

New “Short” Reads are Helpful to Correct Super Long Reads
u PacBio’s CCS is useful for correcting super long Oxford Nanopore reads u Example on the right shows a 800k corrected read aligned at 99.8% accuracy u Shimmer index will help to assembly such super long high accuracy reads super fast

Summary & Future Development u As we get into pan-genomics
era for human and other larger genomes, relive the computation burden for assembly is crucial for speeding up research. u Peregrine’s overlap time complexity is O(nc2) instead of O(n2), 10 to 100 times faster than any existing end-to-end assemblers on good quality data u The consensus module is capable to generate high accuracy contigs (QV >45) without any slow signal level polishing step u With current sequencing technologies, it is possible to assemble 3G bp genome in 100 minutes and it will become even faster in the future. We can be more focusing on solving biological problems than the computational ones. u Future work: u Diploid resolution, perhaps port FALCON-Unzip module to Peregrine u Support super long read assembly u Pan-genomics analysis using SHIMMER Index for fast whole genome alignments

Acknowledgement u Mike Hunkapiller (PacBio), Paul Peluso (PacBio) for generating
and providing free hg002 data u Glennis Logsdon, Mitchell Vollger, Eichler Lab for CHM13 data u Church Lab & Shilpa Garg for PGP-1 data u Chris Dunn for PypeFlow update u Arkarachai Fungtammasan (DNAnexus) for assembly evaluation u Mike Schatz and Heng Li for discussion Thanks For Your Attention

@BDSFoundation

Assembling Human Genome in 100 Minutes

Assembling Human Genome in 100 Minutes

Jason Chin

More Decks by Jason Chin

Other Decks in Science

Featured

Transcript

Jason Chin, Asif Khalak (Twitter: @infoecho, @AsifKhalak) Foundation of Biological

The Dawn of Human Genome Assembly Supercomputer used for Celera

Fast Forward to 2014, The Dawn of Long Noisy Read

It Is Getting Better u After HGAP for long read

What Can We Do With a Bit More Accurate (>99%)

What Can We Really Do? Get Rid of O(n2) For

Minimizer Is An Awesome Data Structure We still need to

One day I had this private discussion with Heng Li.

Sparse & HIerarchical MniMizER (SHIMMER) Indexing Sequence K-mer over moving

Building SHIMMER Along a Read

Mapping Neighboring Minimizers to Reads u Two reads that are

Grouping Reads By Neighboring Minimizer Pairs Shared Minimizer Pair A

~ 30x human data

Shimmer Index Used for Fast Reads to Draft Contig Alignment

CPU Usage For A Couple Human Genomes With Different Parameter

Contiguity 27,848,727 33,364,927 26,459,768 6,746,698 21,936,975 29,260,433 33,307,555 27,316,706 28,222,972

Accuracy Assessment 0 10 20 30 40 50 60 AC270115.1

More Repetitive Genomes Chanos chanos milkfish Salmo trutta brown trout

Executable with Docker $ find /wd/pgp-1-data/ -name "*.fasta" | sort

New “Short” Reads are Helpful to Correct Super Long Reads

Summary & Future Development u As we get into pan-genomics

Acknowledgement u Mike Hunkapiller (PacBio), Paul Peluso (PacBio) for generating

@BDSFoundation