Slide 1

Slide 1 text

For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Evaluation of HG002 v4 draft benchmark against GATK calls on PacBio HiFi reads William Rowell, Sr Bioinformatics Scientist, 2019-10-15 @nothingclever

Slide 2

Slide 2 text

AGENDA -PacBio Sequencing Modes: Long reads (CLR) vs HiFi -HiFi datasets available through GIAB -Detecting variants in HiFi reads with GATK HaplotypeCaller -Evaluation of v4 draft benchmark

Slide 3

Slide 3 text

TWO MODES OF PACBIO SMRT SEQUENCING Continuous Long Read Sequencing (CLR) consensus sequence Long Read 1 . . . . . . . Long Read n Long reads >20 kb, 90% accuracy

Slide 4

Slide 4 text

HiFi reads ≤20 kb, >99% accuracy TWO MODES OF PACBIO SMRT SEQUENCING Continuous Long Read Sequencing (CLR) consensus sequence Long Read 1 . . . . . . . Long Read n Long reads >20 kb, 90% accuracy Circular Consensus Sequencing (CCS) HiFi read Subread 1 . . . . Subread n

Slide 5

Slide 5 text

HIFI READS MAP THROUGH DIFFICULT REGIONS Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). Short reads PacBio HiFi STRC STRC is a congenital deafness gene that requires long reads to cover all exons.

Slide 6

Slide 6 text

PACBIO HIFI DATASETS FOR GIAB SAMPLES Each dataset sequenced to approximately 30-fold coverage Sample Insert length Platform Reads (SRA) Alignments HG002 10 kb Sequel System https://bit.ly/2OCLeA2 https://bit.ly/2OCLeA2 HG002 15 kb Sequel System PRJNA520771 https://bit.ly/2p1ISA8 HG002 11 kb Sequel II System PRJNA527278 https://bit.ly/2VqdJm1 HG001 11 kb Sequel II System PRJNA540705 https://bit.ly/2AWtVSM HG005 11 kb Sequel II System PRJNA540706 https://bit.ly/2ogGbuI

Slide 7

Slide 7 text

DETECTING VARIANTS IN HIFI READS WITH GATK HAPLOTYPECALLER DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011). Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). HiFi reads pbmm2 HaplotypeCaller VariantFiltration variant calls (vcf) GATK4 SMRT Link Mapping

Slide 8

Slide 8 text

DETECTING VARIANTS IN HIFI READS WITH GATK HAPLOTYPECALLER DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011). Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). HiFi reads pbmm2 HaplotypeCaller VariantFiltration variant calls (vcf) GATK4 SMRT Link Mapping -High SNP Recall and Precision -Lower Indel Recall and Precision, due to 1bp indel errors

Slide 9

Slide 9 text

DETECTING VARIANTS IN HIFI READS WITH GATK HAPLOTYPECALLER DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011). Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). HiFi reads pbmm2 HaplotypeCaller VariantFiltration variant calls (vcf) GATK4 SMRT Link Mapping -High SNP Recall and Precision -Lower Indel Recall and Precision, due to 1bp indel errors -HaplotypeCaller optimized for error mode of short reads Indel Mismatch 96.6% PacBio HiFi 99.1% Short reads

Slide 10

Slide 10 text

DETECTING VARIANTS IN HIFI READS WITH GATK HAPLOTYPECALLER DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011). Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). HiFi reads pbmm2 HaplotypeCaller VariantFiltration variant calls (vcf) GATK4 SMRT Link Mapping -High SNP Recall and Precision -Lower Indel Recall and Precision, due to 1bp indel errors -HaplotypeCaller optimized for error mode of short reads -We recommend using a caller that can adapt to the error mode of long reads, such as DeepVariant (see Pi-Chuan Chang’s lightning talk)

Slide 11

Slide 11 text

V4 DRAFT BENCHMARK INCREASES PRECISION AND TRUE POSITIVE VARIANTS Recall Precision TP SNVs 99.7% 99.8% 3,314,633 Indels 86.1% 92.8% 444,945 Recall Precision TP SNVs 99.7% 99.7% 3,306,764 Indels 85.9% 92.7% 444,342 GRCh38 hs37d5 Recall Precision TP SNVs 99.8% 99.6% 3,042,089 Indels 86.3% 92.4% 401,306 Recall Precision TP SNVs 99.8% 99.5% 3,022,502 Indels 83.8% 92.5% 398,726 + ~272k TP SNPs + ~44k TP INDELs + ~284k TP SNPs + ~46k TP INDELs v3.3.2 v4 draft

Slide 12

Slide 12 text

MANUAL CURATION OF FP AND FN General themes: GATK misses or makes incorrect indel calls in homopolymer stretches GATK false positives due to mis-mapped LINE elements and segmental duplications GATK false negatives due to low coverage depth or mapping quality 15 19 2 3 1 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Putative FN Putative FP Manually Curated Discordant Variants Benchmark Correct GATK Callset Correct Unsure Opportunities to improve variant calling: -incorrect indel calls in homopolymer stretches (FP + FN) -mis-mapped LINE elements and segmental duplications (FP) -low mapping quality (FN)

Slide 13

Slide 13 text

FN IN CALLSET - UNSURE ABOUT BENCHMARK Benchmark - homozygous T➔A A/A A/TA T/A A/TA Illumina PacBio HiFi ONT 10X

Slide 14

Slide 14 text

FP IN CALLSET - UNSURE ABOUT BENCHMARK Illumina PacBio HiFi ONT 10X no coverage C/T C/T C/T (odd allele frequency) Benchmark - no call

Slide 15

Slide 15 text

FP IN CALLSET - UNSURE ABOUT BENCHMARK (CONT’D) Illumina PacBio HiFi ONT 10X

Slide 16

Slide 16 text

Illumina FP + FN IN CALLSET - BENCHMARK INCORRECT FOR STR CONTRACTION Benchmark - GGAG⨯9 deletion low coverage GGAG⨯2 deletion ~GGAG⨯2 deletion GGAG⨯2 deletion PacBio HiFi ONT 10X

Slide 17

Slide 17 text

CONCLUSIONS -v4 draft benchmark satisfies GIAB goal for GATK calls on HiFi reads: -75% of putative FN and 95% of putative FP are clearly errors in the GATK callset -Suggestions for improving the benchmark: -Exclude regions with SNV disagreements between long/linked read datasets or odd SNV frequencies (2:1, 3:1) in long/linked read datasets -Require support from long reads for indels in repetitive regions with low short read coverage

Slide 18

Slide 18 text

For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners. www.pacb.com Poster 1866/W Booth 1020