Scrubbing to remove artifacts ✦ Repeat/Haplotype separation based on heterogeneity ✦ Repeat detection and modeling HIFI CCS protocol: ~ 3x loss in throughput and cost over raw But each insert wrapped ~ 8x 㱺 ~ 0.2% error rate Which is better? - 15Kbp reads at 99.8% - 50Kbp reads at 90% - some combination of both? • Longer or more accurate reads (CCS) • Better Algorithms
… microhet’s could get you through (should be easy(er)) All the power of long reads has thus far been due to this ⟹ longer is better This has not been done Requires ability to id. microhet’s ⟹ more accurate is better
Chimers No Adaptamers No Low Q dropouts 99.999% uniformly Haplotype Phased Perfect PB reads with Scrubbing Solves: Artifacts Haplotypes Low Copy Repeats (≤ 5) Scrubber Long Read Assembler Task is easier, but still necessary: 99.8% average .5% of reads are chimers .02% of reads have no adaptamer 15% of reads have a low Q “panel” 100bp with 5% or more error
ever 10 mers (at random) Compute Time Reductions • 99.999% Sensitivity (Alignment between ≧1000bp with .5% error in each read) (R. Durbin) (and uses ≦ 8Gbp memory) • Can use 1Gbp blocks vs. 1/4Gbp (vs 2000+ for raw reads) • 90 CPU hrs for 30X HG002 (on this laptop)