Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HiCanu: Resolving repeats and haplotypes

Sergey Koren
January 14, 2020

HiCanu: Resolving repeats and haplotypes

A new approach for assembly of HiFi accurate long reads

Sergey Koren

January 14, 2020
Tweet

More Decks by Sergey Koren

Other Decks in Science

Transcript

  1. Sergey Koren
    Staff Scientist, Genome Informatics Section, NHGRI
    Resolving repeats and haplotypes
    @sergekoren

    View Slide

  2. What are CCS reads
    Number of reads
    99.9% 99.99%
    105,000
    90,000
    75,000
    60,000
    45,000
    30,000
    15,000
    Q20 Q25 Q30 Q35 Q40 Q45 Q50 Q55 Q60

    View Slide

  3. Highly accurate overlaps, below 1%
    98.0
    98.5
    99.0
    99.5
    100.0
    Original RLE
    identity
    idy

    View Slide

  4. <1% isn’t good enough
    5 10 20 50 100
    0 12 24 36 48 60 72 84
    Read length (kbp)
    NG50 (Mbp)
    >0.001% diverged
    >0.5% diverged
    Peregrine assemblies
    5 10 20 50 100
    0 12 24 36 48 60 72 84
    Read length (kbp)
    NG50 (Mbp)
    Chin et al. Human Genome Assembly in 100 Minutes. 2019

    View Slide

  5. First trick: run-length encoding

    View Slide

  6. Second trick: fix remaining errors

    View Slide

  7. Last trick: ignore systematic errors

    View Slide

  8. Perfect overlaps
    98.0
    98.5
    99.0
    99.5
    100.0
    Original RLE R
    identity
    idy
    Original RLE RLE + Correction
    identity
    name
    O
    R
    R

    View Slide

  9. Near-optimal repeat resolution
    >0.001% diverged
    >0.5% diverged
    5 10 20 50 100
    0 12 24 36 48 60 72 84
    Read length (kbp)
    NG50 (Mbp)
    Peregrine assemblies
    HiCanu assemblies

    View Slide

  10. Errors vs reference
    0
    100
    200
    300
    400
    500
    600
    CHM13 HG0733 NA12878
    Peregrine HiCanu Nanopore 10XG
    Structural errors vs GRCh38 by Quast
    Gurevich et al. QUAST: quality assessment tool for genome assemblies. 2013

    View Slide

  11. Segmental duplication BAC resolution
    0%
    10%
    20%
    30%
    40%
    50%
    60%
    70%
    80%
    90%
    100%
    CHM13 HG0733 NA12878
    Peregrine HiCanu Nanopore 10XG
    30-fold 20 kbp 30-fold 100+ kbp
    Fraction of BAC bases correctly resolved

    View Slide

  12. Remaining Collapsed Bases
    0.00
    50.00
    100.00
    150.00
    200.00
    250.00
    300.00
    CHM13 10kb CHM13 20kb
    Peregrine HiCanu Nanopore
    Q55
    Q48 Q27 Q58
    Q47 Q27
    Mbp with elevated coverage
    Vollger et al. Long-read sequence and assembly of segmental duplications. 2019

    View Slide

  13. 1 2 3 4 5 6 7 8 9 10 11 12
    1 2 3 4 5 6 7 8 9 10 11 12
    HiFi and ONT UL are complementary
    30X CCS
    80X UL
    13 14 15 16 17 18 19 20 21 22 X 13 14 15 16 17 18 19 20 21 22 X

    View Slide

  14. Haplotype separation improved too
    Canu HiFi
    Read N50: 10 kbp
    Total Block BP: 4.17 Gbp
    Phase block NG50: 362 kbp
    Switch 0.03%
    FALCON-Unzip CLR
    Read N50: 17 kbp
    Total block BP: 3.64 Gbp
    Phase block NG50: 229 kbp
    Switch: 0.15%
    Supernova 10XG
    Read N50: 95 kbp
    Total block BP: 4.64 Gbp
    Phase block NG50: 560 kbp
    Switch: 0.16%

    View Slide

  15. • Applicable to any assembler
    • Approaching optimal repeat resolution
    • Your mileage will vary depending on repeat type
    • Centromeres still a challenge
    • Evidence of systematic error/coverage gaps
    • Human-level heterozygosity resolved
    • No polishing needed, consensus >Q50
    • Data collection is the bottleneck
    • 30 hr/cell, 5 days for a 3g mammal
    • CCS consensus, ≈6k cpu hrs, ≈ 2 days on 128 cores
    • Canu asm, ≈2k cpu hrs, ≈ 0.5 day on 128 cores (<8 hrs on a cluster)
    Conclusions

    View Slide

  16. • Works on metagenomes: sheep rumen sample
    • 126 complete genomes (15% contaminated, 107 good)
    • Vs 61 for Canu 1.9 (5% contaminated, 58 good)
    • Vs 55 for Peregrine (11% contaminated, 49 good)
    • 41/44 circular marked by Canu complete and good
    • Vs 10/11 for Canu 1.9
    • More tweaks to come, lowering contamination rate
    One more thing
    Flye 2.5-g315122d: --meta --threads 56 --out-dir ccs_flye --genome-size 5m (1% error rate)
    Canu: genomeSize=3.1g 'batOptions=-eg 0.0 –sb 0.001 -dg 0 -db 3 -dr 0 -ca 2000 -cp 200’
    Canu 1.9: genomeSize=3.1g correctedErrorRate=0.015
    Peregrine: 24 24 24 24 24 24 24 24 24 --with-consensus --shimmer-r 3 --best_n_ovlp 8 --output asm

    View Slide

  17. NHGRI
    • Sergey Nurk
    • Arang Rhie
    • Brian Walenz
    • Adam Phillippy
    • Evan E. Eichler
    • Mitchell Vollger
    • Glennis Logsdon
    • Robert Grothe
    • Jonas Korlach
    • Zev Kronenberg
    • Paul Peluso
    • David Rank
    • Kevin Fengler
    • Sung Bong Shin
    • Tim Smith
    Acknowledgements

    View Slide

  18. View Slide