AGENDA -PacBio Sequencing Modes: Long reads (CLR) vs HiFi -HiFi datasets available through GIAB -Detecting variants in HiFi reads with GATK HaplotypeCaller -Evaluation of v4 draft benchmark
TWO MODES OF PACBIO SMRT SEQUENCING Continuous Long Read Sequencing (CLR) consensus sequence Long Read 1 . . . . . . . Long Read n Long reads >20 kb, 90% accuracy
HIFI READS MAP THROUGH DIFFICULT REGIONS Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). Short reads PacBio HiFi STRC STRC is a congenital deafness gene that requires long reads to cover all exons.
PACBIO HIFI DATASETS FOR GIAB SAMPLES Each dataset sequenced to approximately 30-fold coverage Sample Insert length Platform Reads (SRA) Alignments HG002 10 kb Sequel System https://bit.ly/2OCLeA2 https://bit.ly/2OCLeA2 HG002 15 kb Sequel System PRJNA520771 https://bit.ly/2p1ISA8 HG002 11 kb Sequel II System PRJNA527278 https://bit.ly/2VqdJm1 HG001 11 kb Sequel II System PRJNA540705 https://bit.ly/2AWtVSM HG005 11 kb Sequel II System PRJNA540706 https://bit.ly/2ogGbuI
DETECTING VARIANTS IN HIFI READS WITH GATK HAPLOTYPECALLER DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011). Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). HiFi reads pbmm2 HaplotypeCaller VariantFiltration variant calls (vcf) GATK4 SMRT Link Mapping
DETECTING VARIANTS IN HIFI READS WITH GATK HAPLOTYPECALLER DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011). Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). HiFi reads pbmm2 HaplotypeCaller VariantFiltration variant calls (vcf) GATK4 SMRT Link Mapping -High SNP Recall and Precision -Lower Indel Recall and Precision, due to 1bp indel errors
DETECTING VARIANTS IN HIFI READS WITH GATK HAPLOTYPECALLER DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011). Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). HiFi reads pbmm2 HaplotypeCaller VariantFiltration variant calls (vcf) GATK4 SMRT Link Mapping -High SNP Recall and Precision -Lower Indel Recall and Precision, due to 1bp indel errors -HaplotypeCaller optimized for error mode of short reads Indel Mismatch 96.6% PacBio HiFi 99.1% Short reads
DETECTING VARIANTS IN HIFI READS WITH GATK HAPLOTYPECALLER DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011). Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). HiFi reads pbmm2 HaplotypeCaller VariantFiltration variant calls (vcf) GATK4 SMRT Link Mapping -High SNP Recall and Precision -Lower Indel Recall and Precision, due to 1bp indel errors -HaplotypeCaller optimized for error mode of short reads -We recommend using a caller that can adapt to the error mode of long reads, such as DeepVariant (see Pi-Chuan Chang’s lightning talk)
MANUAL CURATION OF FP AND FN General themes: GATK misses or makes incorrect indel calls in homopolymer stretches GATK false positives due to mis-mapped LINE elements and segmental duplications GATK false negatives due to low coverage depth or mapping quality 15 19 2 3 1 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Putative FN Putative FP Manually Curated Discordant Variants Benchmark Correct GATK Callset Correct Unsure Opportunities to improve variant calling: -incorrect indel calls in homopolymer stretches (FP + FN) -mis-mapped LINE elements and segmental duplications (FP) -low mapping quality (FN)
CONCLUSIONS -v4 draft benchmark satisfies GIAB goal for GATK calls on HiFi reads: -75% of putative FN and 95% of putative FP are clearly errors in the GATK callset -Suggestions for improving the benchmark: -Exclude regions with SNV disagreements between long/linked read datasets or odd SNV frequencies (2:1, 3:1) in long/linked read datasets -Require support from long reads for indels in repetitive regions with low short read coverage