Slide 1

Slide 1 text

For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Detecting and Phasing Small Variants with Highly Accurate Long Reads William Rowell, Senior Scientist, Bioinformatics Applications, PacBio SMRT Informatics Developers Conference, January 16, 2019 @nothingclever

Slide 2

Slide 2 text

AGENDA -Differences between highly accurate long reads and short reads -Calling variants with existing tools -Training new tools on long reads -Making use of phase information to improve variant calls

Slide 3

Slide 3 text

AGENDA -Differences between highly accurate long reads and short reads -Calling variants with existing tools -Training new tools on long reads -Making use of phase information to improve variant calls

Slide 4

Slide 4 text

CCS READS HAVE A DIFFERENT ERROR PROFILE FROM SHORT READS

Slide 5

Slide 5 text

CCS READS HAVE A DIFFERENT ERROR PROFILE FROM SHORT READS

Slide 6

Slide 6 text

SMALL VARIANT DETECTION WITH HIGH FIDELITY READS STARTS WITH A FAMILIAR WORKFLOW CCS Reads Align with minimap2 Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls

Slide 7

Slide 7 text

SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls pbmm2 --preset CCS

Slide 8

Slide 8 text

SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls pbmm2 --preset CCS --pcr-indel-model AGGRESSIVE

Slide 9

Slide 9 text

SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls Strand Bias tests ❌ Mapping Quality tests ❌ Read position tests ❌ Variant Quality tests ✅

Slide 10

Slide 10 text

SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls

Slide 11

Slide 11 text

SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls

Slide 12

Slide 12 text

SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls pbmm2 --preset CCS --pcr-indel-model AGGRESSIVE SNV → QD >= 2.0 1 bp Indels → QD >= 5.0 >1 bp Indels → QD >= 2.0 Precision Recall F1 SNVs 99.468% 99.559% 99.513% Indels 78.977% 81.248% 80.097%

Slide 13

Slide 13 text

GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA TYPES CCS Reads (chr. 1-19) + GIAB Truth Set DeepVariant CNN training CCS model Nature Biotechnology volume 36, pages 983–987 (2018)

Slide 14

Slide 14 text

GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA TYPES CCS Reads (chr. 1-19) + GIAB Truth Set DeepVariant CNN training CCS model Nature Biotechnology volume 36, pages 983–987 (2018) CCS Reads DeepVariant + CCS model Diploid variant calls

Slide 15

Slide 15 text

GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA TYPES CCS Reads (chr. 1-19) + GIAB Truth Set DeepVariant CNN training CCS model Nature Biotechnology volume 36, pages 983–987 (2018) CCS Reads DeepVariant + CCS model Diploid variant calls Precision Recall F1 SNVs 99.914% 99.959% 99.936% Indels 96.901% 95.980% 96.438% autosomes

Slide 16

Slide 16 text

GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA TYPES CCS Reads (chr. 1-19) + GIAB Truth Set DeepVariant CNN training CCS model Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology 36, 983–987 (2018) CCS Reads DeepVariant + CCS model Diploid variant calls Precision Recall F1 SNVs 99.914% 99.959% 99.936% Indels 96.901% 95.980% 96.438% Precision Recall F1 SNVs 99.807% 99.904% 99.855% Indels 95.387% 94.501% 94.942% autosomes chromosome 20

Slide 17

Slide 17 text

BUT WE’RE STILL LEAVING INFORMATION ON THE TABLE…

Slide 18

Slide 18 text

BUT WE’RE STILL LEAVING INFORMATION ON THE TABLE…

Slide 19

Slide 19 text

PHASE INFORMATION CAN IMPROVE PRECISION AND RECALL

Slide 20

Slide 20 text

INCORPORATING PHASE INFORMATION LEADS TO IMPROVEMENTS IN VARIANT RECALL AND PRECISION Ebler, J. et al. Haplotype-aware genotyping from noisy long reads bioRxiv doi: 10.1101/293944 Precision Recall F1 SNVs 99.468% 99.559% 99.513% Indels 78.977% 81.248% 80.097% Precision Recall F1 SNVs 99.693% 99.792% 99.742% Indels 81.102% 83.818% 82.438% GATK HC GATK HC + WhatsHap Incorporate phase information and re-genotype variant positions with WhatsHap

Slide 21

Slide 21 text

INCORPORATING PHASE INFORMATION LEADS TO MODEST INCREASES IN VARIANT RECALL AND PRECISION Precision Recall F1 SNVs 99.468% 99.559% 99.513% Indels 78.977% 81.248% 80.097% Precision Recall F1 SNVs 99.693% 99.792% 99.742% Indels 81.102% 83.818% 82.438% Precision Recall F1 SNVs 99.914% 99.959% 99.936% Indels 96.901% 95.980% 96.438% Precision Recall F1 SNVs 99.904% 99.963% 99.934% Indels 97.835% 97.141% 97.486% GATK HC GATK HC + WhatsHap DeepVariant CCS DeepVariant CCS + haplotype sorting Incorporate phase information and re-genotype variant positions with WhatsHap Tag reads with phase information from trio data and sort alignments by haplotype https://goo.gl/4cnMeC Ebler, J. et al. Haplotype-aware genotyping from noisy long reads bioRxiv doi: 10.1101/293944

Slide 22

Slide 22 text

CONCLUSIONS -GATK HaplotypeCaller (optimized for short reads) can be used to detect SNVs with high recall and precision, but has trouble discriminating between biological indels and sequencing errors. -DeepVariant, when trained on long reads, can be used to detect both SNVs and indels with high recall and precision. -Both workflows can be improved by providing long-distance phasing information, but there’s still work to be done in this area.

Slide 23

Slide 23 text

ACKNOWLEDGEMENTS Google AI Genomics Alexey Kolesnikov Pi-Chuan Chang Andrew Carroll Mark DePristo Saarland University/Max Planck Institute for Informatics Jana Ebler Tobias Marschall PacBio Yufeng Qian Richard Hall Aaron Wenger Paul Peluso David Rank Mike Hunkapiller DNANexus Jason Chin NIST Nathan Olson Justin Zook

Slide 24

Slide 24 text

For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO Pulse and Fragment Analyzer are trademarks of Advanced Analytical Technologies. All other trademarks are the sole property of their respective owners. www.pacb.com