Evaluation of HG002 v4 draft benchmark against GATK calls on PacBio HiFi reads

Evaluation of HG002 v4 draft benchmark against GATK calls on PacBio HiFi reads

Lightning talk.

Summary of the manual curation of the new benchmark.

860c43c4f8fb36f71342e9257cd05671?s=128

William Rowell

October 15, 2019
Tweet

Transcript

  1. For Research Use Only. Not for use in diagnostic procedures.

    © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Evaluation of HG002 v4 draft benchmark against GATK calls on PacBio HiFi reads William Rowell, Sr Bioinformatics Scientist, 2019-10-15 @nothingclever
  2. AGENDA -PacBio Sequencing Modes: Long reads (CLR) vs HiFi -HiFi

    datasets available through GIAB -Detecting variants in HiFi reads with GATK HaplotypeCaller -Evaluation of v4 draft benchmark
  3. TWO MODES OF PACBIO SMRT SEQUENCING Continuous Long Read Sequencing

    (CLR) consensus sequence Long Read 1 . . . . . . . Long Read n Long reads >20 kb, 90% accuracy
  4. HiFi reads ≤20 kb, >99% accuracy TWO MODES OF PACBIO

    SMRT SEQUENCING Continuous Long Read Sequencing (CLR) consensus sequence Long Read 1 . . . . . . . Long Read n Long reads >20 kb, 90% accuracy Circular Consensus Sequencing (CCS) HiFi read Subread 1 . . . . Subread n
  5. HIFI READS MAP THROUGH DIFFICULT REGIONS Wenger, A. M. et

    al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). Short reads PacBio HiFi STRC STRC is a congenital deafness gene that requires long reads to cover all exons.
  6. PACBIO HIFI DATASETS FOR GIAB SAMPLES Each dataset sequenced to

    approximately 30-fold coverage Sample Insert length Platform Reads (SRA) Alignments HG002 10 kb Sequel System https://bit.ly/2OCLeA2 https://bit.ly/2OCLeA2 HG002 15 kb Sequel System PRJNA520771 https://bit.ly/2p1ISA8 HG002 11 kb Sequel II System PRJNA527278 https://bit.ly/2VqdJm1 HG001 11 kb Sequel II System PRJNA540705 https://bit.ly/2AWtVSM HG005 11 kb Sequel II System PRJNA540706 https://bit.ly/2ogGbuI
  7. DETECTING VARIANTS IN HIFI READS WITH GATK HAPLOTYPECALLER DePristo, M.

    A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011). Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). HiFi reads pbmm2 HaplotypeCaller VariantFiltration variant calls (vcf) GATK4 SMRT Link Mapping
  8. DETECTING VARIANTS IN HIFI READS WITH GATK HAPLOTYPECALLER DePristo, M.

    A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011). Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). HiFi reads pbmm2 HaplotypeCaller VariantFiltration variant calls (vcf) GATK4 SMRT Link Mapping -High SNP Recall and Precision -Lower Indel Recall and Precision, due to 1bp indel errors
  9. DETECTING VARIANTS IN HIFI READS WITH GATK HAPLOTYPECALLER DePristo, M.

    A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011). Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). HiFi reads pbmm2 HaplotypeCaller VariantFiltration variant calls (vcf) GATK4 SMRT Link Mapping -High SNP Recall and Precision -Lower Indel Recall and Precision, due to 1bp indel errors -HaplotypeCaller optimized for error mode of short reads Indel Mismatch 96.6% PacBio HiFi 99.1% Short reads
  10. DETECTING VARIANTS IN HIFI READS WITH GATK HAPLOTYPECALLER DePristo, M.

    A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011). Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74, 5463 (2019). HiFi reads pbmm2 HaplotypeCaller VariantFiltration variant calls (vcf) GATK4 SMRT Link Mapping -High SNP Recall and Precision -Lower Indel Recall and Precision, due to 1bp indel errors -HaplotypeCaller optimized for error mode of short reads -We recommend using a caller that can adapt to the error mode of long reads, such as DeepVariant (see Pi-Chuan Chang’s lightning talk)
  11. V4 DRAFT BENCHMARK INCREASES PRECISION AND TRUE POSITIVE VARIANTS Recall

    Precision TP SNVs 99.7% 99.8% 3,314,633 Indels 86.1% 92.8% 444,945 Recall Precision TP SNVs 99.7% 99.7% 3,306,764 Indels 85.9% 92.7% 444,342 GRCh38 hs37d5 Recall Precision TP SNVs 99.8% 99.6% 3,042,089 Indels 86.3% 92.4% 401,306 Recall Precision TP SNVs 99.8% 99.5% 3,022,502 Indels 83.8% 92.5% 398,726 + ~272k TP SNPs + ~44k TP INDELs + ~284k TP SNPs + ~46k TP INDELs v3.3.2 v4 draft
  12. MANUAL CURATION OF FP AND FN General themes: GATK misses

    or makes incorrect indel calls in homopolymer stretches GATK false positives due to mis-mapped LINE elements and segmental duplications GATK false negatives due to low coverage depth or mapping quality 15 19 2 3 1 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Putative FN Putative FP Manually Curated Discordant Variants Benchmark Correct GATK Callset Correct Unsure Opportunities to improve variant calling: -incorrect indel calls in homopolymer stretches (FP + FN) -mis-mapped LINE elements and segmental duplications (FP) -low mapping quality (FN)
  13. FN IN CALLSET - UNSURE ABOUT BENCHMARK Benchmark - homozygous

    T➔A A/A A/TA T/A A/TA Illumina PacBio HiFi ONT 10X
  14. FP IN CALLSET - UNSURE ABOUT BENCHMARK Illumina PacBio HiFi

    ONT 10X no coverage C/T C/T C/T (odd allele frequency) Benchmark - no call
  15. FP IN CALLSET - UNSURE ABOUT BENCHMARK (CONT’D) Illumina PacBio

    HiFi ONT 10X
  16. Illumina FP + FN IN CALLSET - BENCHMARK INCORRECT FOR

    STR CONTRACTION Benchmark - GGAG⨯9 deletion low coverage GGAG⨯2 deletion ~GGAG⨯2 deletion GGAG⨯2 deletion PacBio HiFi ONT 10X
  17. CONCLUSIONS -v4 draft benchmark satisfies GIAB goal for GATK calls

    on HiFi reads: -75% of putative FN and 95% of putative FP are clearly errors in the GATK callset -Suggestions for improving the benchmark: -Exclude regions with SNV disagreements between long/linked read datasets or odd SNV frequencies (2:1, 3:1) in long/linked read datasets -Require support from long reads for indels in repetitive regions with low short read coverage
  18. For Research Use Only. Not for use in diagnostic procedures.

    © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners. www.pacb.com Poster 1866/W Booth 1020