Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HiCanu: Resolving repeats and haplotypes

Sergey Koren
January 14, 2020

HiCanu: Resolving repeats and haplotypes

A new approach for assembly of HiFi accurate long reads

Sergey Koren

January 14, 2020
Tweet

More Decks by Sergey Koren

Other Decks in Science

Transcript

  1. What are CCS reads Number of reads 99.9% 99.99% 105,000

    90,000 75,000 60,000 45,000 30,000 15,000 Q20 Q25 Q30 Q35 Q40 Q45 Q50 Q55 Q60
  2. <1% isn’t good enough 5 10 20 50 100 0

    12 24 36 48 60 72 84 Read length (kbp) NG50 (Mbp) >0.001% diverged >0.5% diverged Peregrine assemblies 5 10 20 50 100 0 12 24 36 48 60 72 84 Read length (kbp) NG50 (Mbp) Chin et al. Human Genome Assembly in 100 Minutes. 2019
  3. Perfect overlaps 98.0 98.5 99.0 99.5 100.0 Original RLE R

    identity idy Original RLE RLE + Correction identity name O R R
  4. Near-optimal repeat resolution >0.001% diverged >0.5% diverged 5 10 20

    50 100 0 12 24 36 48 60 72 84 Read length (kbp) NG50 (Mbp) Peregrine assemblies HiCanu assemblies
  5. Errors vs reference 0 100 200 300 400 500 600

    CHM13 HG0733 NA12878 Peregrine HiCanu Nanopore 10XG Structural errors vs GRCh38 by Quast Gurevich et al. QUAST: quality assessment tool for genome assemblies. 2013
  6. Segmental duplication BAC resolution 0% 10% 20% 30% 40% 50%

    60% 70% 80% 90% 100% CHM13 HG0733 NA12878 Peregrine HiCanu Nanopore 10XG 30-fold 20 kbp 30-fold 100+ kbp Fraction of BAC bases correctly resolved
  7. Remaining Collapsed Bases 0.00 50.00 100.00 150.00 200.00 250.00 300.00

    CHM13 10kb CHM13 20kb Peregrine HiCanu Nanopore Q55 Q48 Q27 Q58 Q47 Q27 Mbp with elevated coverage Vollger et al. Long-read sequence and assembly of segmental duplications. 2019
  8. 1 2 3 4 5 6 7 8 9 10

    11 12 1 2 3 4 5 6 7 8 9 10 11 12 HiFi and ONT UL are complementary 30X CCS 80X UL 13 14 15 16 17 18 19 20 21 22 X 13 14 15 16 17 18 19 20 21 22 X
  9. Haplotype separation improved too Canu HiFi Read N50: 10 kbp

    Total Block BP: 4.17 Gbp Phase block NG50: 362 kbp Switch 0.03% FALCON-Unzip CLR Read N50: 17 kbp Total block BP: 3.64 Gbp Phase block NG50: 229 kbp Switch: 0.15% Supernova 10XG Read N50: 95 kbp Total block BP: 4.64 Gbp Phase block NG50: 560 kbp Switch: 0.16%
  10. • Applicable to any assembler • Approaching optimal repeat resolution

    • Your mileage will vary depending on repeat type • Centromeres still a challenge • Evidence of systematic error/coverage gaps • Human-level heterozygosity resolved • No polishing needed, consensus >Q50 • Data collection is the bottleneck • 30 hr/cell, 5 days for a 3g mammal • CCS consensus, ≈6k cpu hrs, ≈ 2 days on 128 cores • Canu asm, ≈2k cpu hrs, ≈ 0.5 day on 128 cores (<8 hrs on a cluster) Conclusions
  11. • Works on metagenomes: sheep rumen sample • 126 complete

    genomes (15% contaminated, 107 good) • Vs 61 for Canu 1.9 (5% contaminated, 58 good) • Vs 55 for Peregrine (11% contaminated, 49 good) • 41/44 circular marked by Canu complete and good • Vs 10/11 for Canu 1.9 • More tweaks to come, lowering contamination rate One more thing Flye 2.5-g315122d: --meta --threads 56 --out-dir ccs_flye --genome-size 5m (1% error rate) Canu: genomeSize=3.1g 'batOptions=-eg 0.0 –sb 0.001 -dg 0 -db 3 -dr 0 -ca 2000 -cp 200’ Canu 1.9: genomeSize=3.1g correctedErrorRate=0.015 Peregrine: 24 24 24 24 24 24 24 24 24 --with-consensus --shimmer-r 3 --best_n_ovlp 8 --output asm
  12. NHGRI • Sergey Nurk • Arang Rhie • Brian Walenz

    • Adam Phillippy • Evan E. Eichler • Mitchell Vollger • Glennis Logsdon • Robert Grothe • Jonas Korlach • Zev Kronenberg • Paul Peluso • David Rank • Kevin Fengler • Sung Bong Shin • Tim Smith Acknowledgements