HiCanu: Resolving repeats and haplotypes

281a319254be3f6c15e18bad345426e2?s=47 Sergey Koren
January 14, 2020

HiCanu: Resolving repeats and haplotypes

A new approach for assembly of HiFi accurate long reads

281a319254be3f6c15e18bad345426e2?s=128

Sergey Koren

January 14, 2020
Tweet

Transcript

  1. Sergey Koren Staff Scientist, Genome Informatics Section, NHGRI Resolving repeats

    and haplotypes @sergekoren
  2. What are CCS reads Number of reads 99.9% 99.99% 105,000

    90,000 75,000 60,000 45,000 30,000 15,000 Q20 Q25 Q30 Q35 Q40 Q45 Q50 Q55 Q60
  3. Highly accurate overlaps, below 1% 98.0 98.5 99.0 99.5 100.0

    Original RLE identity idy
  4. <1% isn’t good enough 5 10 20 50 100 0

    12 24 36 48 60 72 84 Read length (kbp) NG50 (Mbp) >0.001% diverged >0.5% diverged Peregrine assemblies 5 10 20 50 100 0 12 24 36 48 60 72 84 Read length (kbp) NG50 (Mbp) Chin et al. Human Genome Assembly in 100 Minutes. 2019
  5. First trick: run-length encoding

  6. Second trick: fix remaining errors

  7. Last trick: ignore systematic errors

  8. Perfect overlaps 98.0 98.5 99.0 99.5 100.0 Original RLE R

    identity idy Original RLE RLE + Correction identity name O R R
  9. Near-optimal repeat resolution >0.001% diverged >0.5% diverged 5 10 20

    50 100 0 12 24 36 48 60 72 84 Read length (kbp) NG50 (Mbp) Peregrine assemblies HiCanu assemblies
  10. Errors vs reference 0 100 200 300 400 500 600

    CHM13 HG0733 NA12878 Peregrine HiCanu Nanopore 10XG Structural errors vs GRCh38 by Quast Gurevich et al. QUAST: quality assessment tool for genome assemblies. 2013
  11. Segmental duplication BAC resolution 0% 10% 20% 30% 40% 50%

    60% 70% 80% 90% 100% CHM13 HG0733 NA12878 Peregrine HiCanu Nanopore 10XG 30-fold 20 kbp 30-fold 100+ kbp Fraction of BAC bases correctly resolved
  12. Remaining Collapsed Bases 0.00 50.00 100.00 150.00 200.00 250.00 300.00

    CHM13 10kb CHM13 20kb Peregrine HiCanu Nanopore Q55 Q48 Q27 Q58 Q47 Q27 Mbp with elevated coverage Vollger et al. Long-read sequence and assembly of segmental duplications. 2019
  13. 1 2 3 4 5 6 7 8 9 10

    11 12 1 2 3 4 5 6 7 8 9 10 11 12 HiFi and ONT UL are complementary 30X CCS 80X UL 13 14 15 16 17 18 19 20 21 22 X 13 14 15 16 17 18 19 20 21 22 X
  14. Haplotype separation improved too Canu HiFi Read N50: 10 kbp

    Total Block BP: 4.17 Gbp Phase block NG50: 362 kbp Switch 0.03% FALCON-Unzip CLR Read N50: 17 kbp Total block BP: 3.64 Gbp Phase block NG50: 229 kbp Switch: 0.15% Supernova 10XG Read N50: 95 kbp Total block BP: 4.64 Gbp Phase block NG50: 560 kbp Switch: 0.16%
  15. • Applicable to any assembler • Approaching optimal repeat resolution

    • Your mileage will vary depending on repeat type • Centromeres still a challenge • Evidence of systematic error/coverage gaps • Human-level heterozygosity resolved • No polishing needed, consensus >Q50 • Data collection is the bottleneck • 30 hr/cell, 5 days for a 3g mammal • CCS consensus, ≈6k cpu hrs, ≈ 2 days on 128 cores • Canu asm, ≈2k cpu hrs, ≈ 0.5 day on 128 cores (<8 hrs on a cluster) Conclusions
  16. • Works on metagenomes: sheep rumen sample • 126 complete

    genomes (15% contaminated, 107 good) • Vs 61 for Canu 1.9 (5% contaminated, 58 good) • Vs 55 for Peregrine (11% contaminated, 49 good) • 41/44 circular marked by Canu complete and good • Vs 10/11 for Canu 1.9 • More tweaks to come, lowering contamination rate One more thing Flye 2.5-g315122d: --meta --threads 56 --out-dir ccs_flye --genome-size 5m (1% error rate) Canu: genomeSize=3.1g 'batOptions=-eg 0.0 –sb 0.001 -dg 0 -db 3 -dr 0 -ca 2000 -cp 200’ Canu 1.9: genomeSize=3.1g correctedErrorRate=0.015 Peregrine: 24 24 24 24 24 24 24 24 24 --with-consensus --shimmer-r 3 --best_n_ovlp 8 --output asm
  17. NHGRI • Sergey Nurk • Arang Rhie • Brian Walenz

    • Adam Phillippy • Evan E. Eichler • Mitchell Vollger • Glennis Logsdon • Robert Grothe • Jonas Korlach • Zev Kronenberg • Paul Peluso • David Rank • Kevin Fengler • Sung Bong Shin • Tim Smith Acknowledgements
  18. None