Detecting and Phasing Small Variants with Highly Accurate Long Reads

Detecting and Phasing Small Variants with Highly Accurate Long Reads

We summarize the challenges around small variant detection for highly accurate (>=99%) long reads and present workflow solutions using existing tools (GATK) and new tools (DeepVariant with trained CCS model). Presented at PacBio SMRT Informatics Workshop in San Diego.

860c43c4f8fb36f71342e9257cd05671?s=128

William Rowell

January 16, 2019
Tweet

Transcript

  1. For Research Use Only. Not for use in diagnostic procedures.

    © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Detecting and Phasing Small Variants with Highly Accurate Long Reads William Rowell, Senior Scientist, Bioinformatics Applications, PacBio SMRT Informatics Developers Conference, January 16, 2019 @nothingclever
  2. AGENDA -Differences between highly accurate long reads and short reads

    -Calling variants with existing tools -Training new tools on long reads -Making use of phase information to improve variant calls
  3. AGENDA -Differences between highly accurate long reads and short reads

    -Calling variants with existing tools -Training new tools on long reads -Making use of phase information to improve variant calls
  4. CCS READS HAVE A DIFFERENT ERROR PROFILE FROM SHORT READS

  5. CCS READS HAVE A DIFFERENT ERROR PROFILE FROM SHORT READS

  6. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS STARTS WITH A

    FAMILIAR WORKFLOW CCS Reads Align with minimap2 Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls
  7. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR

    WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls pbmm2 --preset CCS
  8. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR

    WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls pbmm2 --preset CCS --pcr-indel-model AGGRESSIVE
  9. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR

    WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls Strand Bias tests ❌ Mapping Quality tests ❌ Read position tests ❌ Variant Quality tests ✅
  10. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR

    WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls
  11. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR

    WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls
  12. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR

    WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls pbmm2 --preset CCS --pcr-indel-model AGGRESSIVE SNV → QD >= 2.0 1 bp Indels → QD >= 5.0 >1 bp Indels → QD >= 2.0 Precision Recall F1 SNVs 99.468% 99.559% 99.513% Indels 78.977% 81.248% 80.097%
  13. GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA TYPES CCS

    Reads (chr. 1-19) + GIAB Truth Set DeepVariant CNN training CCS model Nature Biotechnology volume 36, pages 983–987 (2018)
  14. GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA TYPES CCS

    Reads (chr. 1-19) + GIAB Truth Set DeepVariant CNN training CCS model Nature Biotechnology volume 36, pages 983–987 (2018) CCS Reads DeepVariant + CCS model Diploid variant calls
  15. GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA TYPES CCS

    Reads (chr. 1-19) + GIAB Truth Set DeepVariant CNN training CCS model Nature Biotechnology volume 36, pages 983–987 (2018) CCS Reads DeepVariant + CCS model Diploid variant calls Precision Recall F1 SNVs 99.914% 99.959% 99.936% Indels 96.901% 95.980% 96.438% autosomes
  16. GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA TYPES CCS

    Reads (chr. 1-19) + GIAB Truth Set DeepVariant CNN training CCS model Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology 36, 983–987 (2018) CCS Reads DeepVariant + CCS model Diploid variant calls Precision Recall F1 SNVs 99.914% 99.959% 99.936% Indels 96.901% 95.980% 96.438% Precision Recall F1 SNVs 99.807% 99.904% 99.855% Indels 95.387% 94.501% 94.942% autosomes chromosome 20
  17. BUT WE’RE STILL LEAVING INFORMATION ON THE TABLE…

  18. BUT WE’RE STILL LEAVING INFORMATION ON THE TABLE…

  19. PHASE INFORMATION CAN IMPROVE PRECISION AND RECALL

  20. INCORPORATING PHASE INFORMATION LEADS TO IMPROVEMENTS IN VARIANT RECALL AND

    PRECISION Ebler, J. et al. Haplotype-aware genotyping from noisy long reads bioRxiv doi: 10.1101/293944 Precision Recall F1 SNVs 99.468% 99.559% 99.513% Indels 78.977% 81.248% 80.097% Precision Recall F1 SNVs 99.693% 99.792% 99.742% Indels 81.102% 83.818% 82.438% GATK HC GATK HC + WhatsHap Incorporate phase information and re-genotype variant positions with WhatsHap
  21. INCORPORATING PHASE INFORMATION LEADS TO MODEST INCREASES IN VARIANT RECALL

    AND PRECISION Precision Recall F1 SNVs 99.468% 99.559% 99.513% Indels 78.977% 81.248% 80.097% Precision Recall F1 SNVs 99.693% 99.792% 99.742% Indels 81.102% 83.818% 82.438% Precision Recall F1 SNVs 99.914% 99.959% 99.936% Indels 96.901% 95.980% 96.438% Precision Recall F1 SNVs 99.904% 99.963% 99.934% Indels 97.835% 97.141% 97.486% GATK HC GATK HC + WhatsHap DeepVariant CCS DeepVariant CCS + haplotype sorting Incorporate phase information and re-genotype variant positions with WhatsHap Tag reads with phase information from trio data and sort alignments by haplotype https://goo.gl/4cnMeC Ebler, J. et al. Haplotype-aware genotyping from noisy long reads bioRxiv doi: 10.1101/293944
  22. CONCLUSIONS -GATK HaplotypeCaller (optimized for short reads) can be used

    to detect SNVs with high recall and precision, but has trouble discriminating between biological indels and sequencing errors. -DeepVariant, when trained on long reads, can be used to detect both SNVs and indels with high recall and precision. -Both workflows can be improved by providing long-distance phasing information, but there’s still work to be done in this area.
  23. ACKNOWLEDGEMENTS Google AI Genomics Alexey Kolesnikov Pi-Chuan Chang Andrew Carroll

    Mark DePristo Saarland University/Max Planck Institute for Informatics Jana Ebler Tobias Marschall PacBio Yufeng Qian Richard Hall Aaron Wenger Paul Peluso David Rank Mike Hunkapiller DNANexus Jason Chin NIST Nathan Olson Justin Zook
  24. For Research Use Only. Not for use in diagnostic procedures.

    © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO Pulse and Fragment Analyzer are trademarks of Advanced Analytical Technologies. All other trademarks are the sole property of their respective owners. www.pacb.com