Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Assembling Human Genome in 100 Minutes

Jason Chin
May 23, 2019

Assembling Human Genome in 100 Minutes

How can we do super-fast genome assembly with a "SHIMMER" indexing scheme?

Abstract

De novo genome assembly is the most unbiased way to acquire comprehensive genomic information and to gain insight for new DNA sequences that may not exist in reference genomes. Many de novo human genomes are published in the last couple of years leveraging cheaper short-read and single-molecule long-read technologies . Along with the scale of sequencing work, the computation burden persists for generating assemblies. The most common long-read assembly framework using overlap- layout-consensus paradigm requires all-to-all read comparisons. The computation complexity of this comparison step scales quadratically with the number of reads. Most methods still require hundreds to thousands of CPU hours although various techniques have been developed to reject non-overlapped pairs fast or to reduce the extra computation for repeats. High computation requirement persists for more accurate long reads (accuracy ~99% and length ~11 to 15k), which is achievable with current sequencing technologies.
We introduce the de novo assembler Peregrine , which uses a novel minimizer based read index schema. This allows the removal of the all-to-all read comparisons. Instead, read pairs with high overlapping probability are gathered in one step and compared by utilizing the index. In our initial implementation, we can assemble 28x to 32x human PacBio CCS read datasets in less than 20 cpu hours and two wall-clock hours to high contiguity (N50 > 20Mb). The continues advent of sequencing technologies in terms of read length and based accuracy together with Peregrine will enable routine generation of human de novo assemblies. This leads to more comprehensive representation of the genomic variations on population scale beyond SNPs and small indels. We further applied Peregrine successfully to non mammalian genomes such as plants. Future implementations will enable the usage of less accurate long reads such as Oxford Nanopore and longer PacBio reads.

Jason Chin

May 23, 2019
Tweet

More Decks by Jason Chin

Other Decks in Science

Transcript

  1. Jason Chin, Asif Khalak (Twitter: @infoecho, @AsifKhalak)
    Foundation of Biological Data Science
    Sequencing, Finishing and Analysis in the Future Meeting, May 23, 2019
    Assembling Human
    Genome in 100 Minutes

    View Slide

  2. The Dawn of Human Genome Assembly
    Supercomputer used for Celera Genomics’
    first WGS human Assembly in early 2000

    View Slide

  3. Fast Forward to 2014, The Dawn of Long
    Noisy Read Genome Assembly
    u Two Overlap Steps:
    u For error correction
    u For assembly graph
    construction
    u First human genome
    assembly done this way
    took 50+ CPU years.

    View Slide

  4. It Is Getting
    Better
    u After HGAP for long read assembly, the community
    starts to build more adequate overlappers designed to
    be more efficient for noisy long reads
    u Konstantin Berlin, et. al. MHAP, Locality Sensitive
    Hashing -> Canu Assembler
    u Gene Myers, daligner, cache coherence, high performance
    distribute computing modules -> FALCON Assembler
    u Myself, some earlier attempts that had never seen the
    light
    u More to come after 2015
    u MECAT, Canu, miniasm2, wtdbg2, flye, shasta

    View Slide

  5. What Can We Do With a Bit More Accurate (>99%)
    and Longer (>10kb) Sequences for Assembly?
    u Current assemblers can assemble the better data faster (Not surprising)
    u Most assemblers still need read-to-read comparison which has a computing
    time complexity ~ O(n2), n: number of reads
    u New approach like WTDBG2 can do assembly significantly faster than others.
    N50 = 12Mb to 29Mb depending on
    data quality / parameters
    CPU time ~ 200 – 3000 cpu hours

    View Slide

  6. What Can We
    Really Do?
    Get Rid of O(n2)
    For OLC
    Assemblers!!
    u Genome is most “linear”
    u Genome assembly ~ sorting reads in a linear coordinate
    along the chromosomes
    u No one uses O(n2) algorithm for sorting
    u Radix Sort has complexity linear to the number of items
    u How can we make genome assembly more like Radix
    Sort than Bubble Sort?
    u A very simple ideal: an indexing structure to group reads
    that highly like to overlap with each other fast!!
    u We need some efficient way to “bin” the reads

    View Slide

  7. Minimizer Is An Awesome Data Structure
    We still need to very efficient way
    to compute very sparse minimizers
    to reduce the numbers of bins.

    View Slide

  8. One day I had this
    private discussion with
    Heng Li.

    View Slide

  9. Sparse & HIerarchical MniMizER (SHIMMER) Indexing
    Sequence
    K-mer over
    moving windows
    Hash Values
    Level-0
    Minimizers
    Level-1
    Minimizers
    Level-2
    Minimizers
    Digest for longer sequence
    Smaller Index Size
    Larger Index Size

    View Slide

  10. Building SHIMMER Along a Read

    View Slide

  11. Mapping Neighboring Minimizers to Reads
    u Two reads that are overlapped with each other are likely to share a co-linear
    set of SHIMMERs.
    Read 1
    Read 2
    Shared Neighboring
    Minimizer Pair
    Build hash-map F: (Minimizer Pair) → [ Read1, Read2, …] for all reads that
    have the same neighboring minimizer pairs
    Inconsistency
    caused by errors
    Shared Neighboring
    Minimizer Pair

    View Slide

  12. View Slide

  13. Grouping Reads By Neighboring Minimizer Pairs
    Shared Minimizer Pair
    A group that
    are likely
    overlapped
    with each
    other
    Confirm overlaps by base-to-base or
    minimizer-to-minimizer alignment
    Overlaps to Assembly String
    Graph
    Contigs
    Complexity ~ O(nc2)
    instead of O(n2)
    n: number of reads
    c: coverage
    We have c2 ≪ n
    Number of reads grouped
    ~ sequence coverage

    View Slide

  14. ~ 30x human data

    View Slide

  15. Shimmer Index Used for Fast Reads to
    Draft Contig Alignment for Consensus
    Re-use the Shimmer Index for the reads
    Build the Shimmer Index for the draft contigs
    Shared Neighboring
    Minimizer Pairs
    Full genome consensus polishing with “falcon-sense” DAG consensus module
    Super fast mapping for consensus

    View Slide

  16. CPU Usage For A Couple Human Genomes With
    Different Parameter Sets
    0.000
    5.000
    10.000
    15.000
    20.000
    25.000
    hg002-16-80-6-2 hg002-16-80-4-2
    hg002-16-80-18-
    1
    hg002-16-80-36-
    1
    hg002_sequel2-
    16-80-6-2
    chm13-16-80-4-
    2
    chm13-16-80-3-
    2
    pgp1-16-80-4-2 pgp1-16-80-3-2
    consensus 8.290 7.325 6.736 7.023 4.397 7.453 6.297 6.610 7.145
    assembly 0.716 0.800 0.743 0.641 0.463 0.630 0.682 0.323 0.365
    overlapping 6.687 11.880 7.767 3.646 4.139 6.822 8.897 8.151 10.846
    indexing 0.493 0.511 0.491 0.494 0.256 0.443 0.452 0.405 0.403
    seqdb building 0.073 0.080 0.081 0.075 0.027 0.067 0.061 0.045 0.047
    CPU HOURS
    CPU HOURS FOR DIFFERENT GENOMES
    & PARAMETER SETS
    Wall clock time is from 1
    to 2 hours using 24 cores
    on m5d.metal or
    r5d.12xlarge on AWS.

    View Slide

  17. Contiguity
    27,848,727
    33,364,927
    26,459,768
    6,746,698
    21,936,975
    29,260,433
    33,307,555
    27,316,706 28,222,972
    0
    5,000,000
    10,000,000
    15,000,000
    20,000,000
    25,000,000
    30,000,000
    35,000,000
    40,000,000
    hg002-16-80-6-2
    hg002-16-80-4-2
    hg002-16-80-18-1
    hg002-16-80-36-1
    hg002_sequl2-16-80-6-2
    chm
    13-16-80-4-2
    chm
    13-16-80-3-2
    pgp1-16-80-4-2
    pgp1-16-80-3-2
    N50

    View Slide

  18. Accuracy Assessment
    0
    10
    20
    30
    40
    50
    60
    AC270115.1
    AC270117.1
    AC270118.1
    AC270119.1
    AC270120.1
    AC270122.1
    AC270131.1
    AC270132.1
    AC270133.1
    AC270134.1
    AC270135.1
    AC270136.1
    AC270137.1
    AC270145.1
    AC270146.1
    AC270238.1
    AC275285.1
    AC275291.1
    AC275297.1
    AC275298.1
    AC275300.1
    AC275301.1
    AC275304.1
    AC275305.1
    AC278482.1
    AC278741.1
    AC278929.1
    AC279018.1
    AC279070.1
    Accuracy Evaluation for CHM13 Assembly
    Draft Contig Polished Contig
    Homo sapiens BAC clone VMRC59
    Concordance
    (Phred QV scale)

    View Slide

  19. More Repetitive Genomes
    Chanos chanos
    milkfish
    Salmo trutta
    brown trout
    N50: 800 kb
    Assembly Size: 2.3G
    Assembly Time: 70 mins
    N50: 2.4 Mb
    Assembly Size: 705M
    Assembly Time: 44 mins
    ?

    View Slide

  20. Executable with Docker
    $ find /wd/pgp-1-data/ -name "*.fasta" | sort > pgp-1-seqdata.lst
    $ docker run -it -v /wd:/wd cschin/peregrine:0.1.5.0 \
    asm /wd/pgp-1-seqdata.lst 24 24 24 24 24 24 24 24 24 \
    --with-consensus \
    --shimmer-r 3 --best_n_ovlp 8 \
    --output /wd/pgp-1-asm-r3-pg0.1.5.0
    https://github.com/cschin/Peregrine/blob/master/README.md
    Follow @infoecho, @BDSFoundation for update

    View Slide

  21. New “Short” Reads
    are Helpful to Correct
    Super Long Reads
    u PacBio’s CCS is useful for
    correcting super long Oxford
    Nanopore reads
    u Example on the right shows a
    800k corrected read aligned at
    99.8% accuracy
    u Shimmer index will help to
    assembly such super long high
    accuracy reads super fast

    View Slide

  22. Summary & Future Development
    u As we get into pan-genomics era for human and other larger genomes, relive
    the computation burden for assembly is crucial for speeding up research.
    u Peregrine’s overlap time complexity is O(nc2) instead of O(n2), 10 to 100 times
    faster than any existing end-to-end assemblers on good quality data
    u The consensus module is capable to generate high accuracy contigs (QV >45)
    without any slow signal level polishing step
    u With current sequencing technologies, it is possible to assemble 3G bp
    genome in 100 minutes and it will become even faster in the future. We can
    be more focusing on solving biological problems than the computational ones.
    u Future work:
    u Diploid resolution, perhaps port FALCON-Unzip module to Peregrine
    u Support super long read assembly
    u Pan-genomics analysis using SHIMMER Index for fast whole genome alignments

    View Slide

  23. Acknowledgement
    u Mike Hunkapiller (PacBio), Paul Peluso (PacBio) for generating and providing free
    hg002 data
    u Glennis Logsdon, Mitchell Vollger, Eichler Lab for CHM13 data
    u Church Lab & Shilpa Garg for PGP-1 data
    u Chris Dunn for PypeFlow update
    u Arkarachai Fungtammasan (DNAnexus) for assembly evaluation
    u Mike Schatz and Heng Li for discussion
    Thanks For Your Attention

    View Slide

  24. @BDSFoundation

    View Slide