Using PacBio Circular Consensus Sequencing (CCS) for Highly Accurate Assemblies

Slide 1

Slide 1 text

Using PacBio Circular Consensus Sequencing (CCS) for Highly Accurate Assemblies Gene Myers Chair of Systems Biology MPI for Cell Biology and Genetics Dresden, DE MPI CBG CSBD

Slide 2

Slide 2 text

HQ Genomics Today & Tomorrow

Slide 3

Slide 3 text

PB only or PB + 1 would be a significant savings 2019/2020 1.2K EU 3.5K EU 50X Illumina in 10X read clouds Bionano restriction maps 50X Illumina in Hi-C read pairs Scaffolding Technologies 60X Pacbio long reads 10K EU 5K EU 2017 Assume 1Gbp Genome PB + Bionano: 6 Bats Project: 10Mbp Contig N50 100Mbp Scaffold N50 HQ Genomics But favor PB + Hi-C

Slide 4

Slide 4 text

At least 2 ways to improve: HQ Genomes Tomorrow: ✦ Scrubbing to remove artifacts ✦ Repeat/Haplotype separation based on heterogeneity ✦ Repeat detection and modeling HIFI CCS protocol: ~ 3x loss in throughput and cost over raw But each insert wrapped ~ 8x 㱺 ~ 0.2% error rate Which is better? - 15Kbp reads at 99.8% - 50Kbp reads at 90% - some combination of both? • Longer or more accurate reads (CCS) • Better Algorithms

Slide 5

Slide 5 text

Conceptual Effect of Read Accuracy on Assembly

Slide 6

Slide 6 text

String Graph: The “Reality” “Hairball” … … ? How do you get through ?

Slide 7

Slide 7 text

String Graph: The “Reality” 10% error 30% alignment threshold 㱺 10%-repeats entangle .5% error 2% alignment threshold 㱺 only <1%-repeats entangle

Slide 8

Slide 8 text

… … small large Solving Repeats spanning reads suffice … … microhet’s could get you through (should be easy(er)) All the power of long reads has thus far been due to this ⟹ longer is better This has not been done Requires ability to id. microhet’s ⟹ more accurate is better

Slide 9

Slide 9 text

Preliminary Work

Slide 10

Slide 10 text

Reads: Chimers Adaptamers Low Q dropouts 90% average Reads: No Chimers No Adaptamers No Low Q dropouts 99.999% uniformly Haplotype Phased Perfect PB reads with Scrubbing Solves: Artifacts Haplotypes Low Copy Repeats (≤ 5) Scrubber Long Read Assembler Task is easier, but still necessary: 99.8% average .5% of reads are chimers .02% of reads have no adaptamer 15% of reads have a low Q “panel” 100bp with 5% or more error

Slide 11

Slide 11 text

Daligner: Switch from 14-mers to 40-mers Take 1 out of ever 10 mers (at random) Compute Time Reductions • 99.999% Sensitivity (Alignment between ≧1000bp with .5% error in each read) (R. Durbin) (and uses ≦ 8Gbp memory) • Can use 1Gbp blocks vs. 1/4Gbp (vs 2000+ for raw reads) • 90 CPU hrs for 30X HG002 (on this laptop)

Slide 12

Slide 12 text

Concluding Remarks

Slide 13

Slide 13 text

Accelerates assembly compute time HiFi reads likely to improve diploid assembly Likely to be quite effective at haplotype phasing / repeat separation Better CCS algorithms are needed

Slide 14

Slide 14 text

Acknowledgements PacBio Paul Paluso James Drake Kevin Corcoran Jonas Korlach Mike Hunkapillar Dresden-Concept Genome Center CRTD TU-D Andreas Dahl MPI-CBG Sylke Winkler Martin Pippel German Tischler