Using PacBio Circular Consensus Sequencing (CCS) for Highly Accurate Assemblies

Using PacBio Circular Consensus Sequencing (CCS) for Highly Accurate Assemblies
Gene Myers Chair of Systems Biology MPI for Cell Biology and Genetics Dresden, DE MPI CBG CSBD

HQ Genomics Today & Tomorrow

PB only or PB + 1 would be a significant
savings 2019/2020 1.2K EU 3.5K EU 50X Illumina in 10X read clouds Bionano restriction maps 50X Illumina in Hi-C read pairs Scaffolding Technologies 60X Pacbio long reads 10K EU 5K EU 2017 Assume 1Gbp Genome PB + Bionano: 6 Bats Project: 10Mbp Contig N50 100Mbp Scaffold N50 HQ Genomics But favor PB + Hi-C

At least 2 ways to improve: HQ Genomes Tomorrow: ✦
Scrubbing to remove artifacts ✦ Repeat/Haplotype separation based on heterogeneity ✦ Repeat detection and modeling HIFI CCS protocol: ~ 3x loss in throughput and cost over raw But each insert wrapped ~ 8x 㱺 ~ 0.2% error rate Which is better? - 15Kbp reads at 99.8% - 50Kbp reads at 90% - some combination of both? • Longer or more accurate reads (CCS) • Better Algorithms

Conceptual Effect of Read Accuracy on Assembly

String Graph: The “Reality” “Hairball” … … ? How do
you get through ?

String Graph: The “Reality” 10% error 30% alignment threshold 㱺
10%-repeats entangle .5% error 2% alignment threshold 㱺 only <1%-repeats entangle

… … small large Solving Repeats spanning reads suffice …
… microhet’s could get you through (should be easy(er)) All the power of long reads has thus far been due to this ⟹ longer is better This has not been done Requires ability to id. microhet’s ⟹ more accurate is better

Preliminary Work

Reads: Chimers Adaptamers Low Q dropouts 90% average Reads: No
Chimers No Adaptamers No Low Q dropouts 99.999% uniformly Haplotype Phased Perfect PB reads with Scrubbing Solves: Artifacts Haplotypes Low Copy Repeats (≤ 5) Scrubber Long Read Assembler Task is easier, but still necessary: 99.8% average .5% of reads are chimers .02% of reads have no adaptamer 15% of reads have a low Q “panel” 100bp with 5% or more error

Daligner: Switch from 14-mers to 40-mers Take 1 out of
ever 10 mers (at random) Compute Time Reductions • 99.999% Sensitivity (Alignment between ≧1000bp with .5% error in each read) (R. Durbin) (and uses ≦ 8Gbp memory) • Can use 1Gbp blocks vs. 1/4Gbp (vs 2000+ for raw reads) • 90 CPU hrs for 30X HG002 (on this laptop)

Concluding Remarks

Accelerates assembly compute time HiFi reads likely to improve diploid
assembly Likely to be quite effective at haplotype phasing / repeat separation Better CCS algorithms are needed

Acknowledgements PacBio Paul Paluso James Drake Kevin Corcoran Jonas Korlach
Mike Hunkapillar Dresden-Concept Genome Center CRTD TU-D Andreas Dahl MPI-CBG Sylke Winkler Martin Pippel German Tischler

Using PacBio Circular Consensus Sequencing (CCS...

Using PacBio Circular Consensus Sequencing (CCS) for Highly Accurate Assemblies

GenomeArk

More Decks by GenomeArk

Other Decks in Research

Featured

Transcript