Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Evaluation of HG002 v4 draft benchmark against GATK calls on PacBio HiFi reads

William Rowell
October 15, 2019

Evaluation of HG002 v4 draft benchmark against GATK calls on PacBio HiFi reads

Lightning talk.

Summary of the manual curation of the new benchmark.

William Rowell

October 15, 2019
Tweet

More Decks by William Rowell

Other Decks in Science

Transcript

  1. For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved.
    Evaluation of HG002 v4 draft benchmark
    against GATK calls on PacBio HiFi reads
    William Rowell, Sr Bioinformatics Scientist, 2019-10-15
    @nothingclever

    View Slide

  2. AGENDA
    -PacBio Sequencing Modes: Long reads (CLR) vs HiFi
    -HiFi datasets available through GIAB
    -Detecting variants in HiFi reads with GATK
    HaplotypeCaller
    -Evaluation of v4 draft benchmark

    View Slide

  3. TWO MODES OF PACBIO SMRT SEQUENCING
    Continuous Long Read
    Sequencing (CLR)
    consensus sequence
    Long Read 1
    .
    .
    .
    .
    .
    .
    .
    Long Read n
    Long reads >20 kb,
    90% accuracy

    View Slide

  4. HiFi reads ≤20 kb,
    >99% accuracy
    TWO MODES OF PACBIO SMRT SEQUENCING
    Continuous Long Read
    Sequencing (CLR)
    consensus sequence
    Long Read 1
    .
    .
    .
    .
    .
    .
    .
    Long Read n
    Long reads >20 kb,
    90% accuracy
    Circular Consensus
    Sequencing (CCS)
    HiFi read
    Subread 1
    .
    .
    .
    .
    Subread n

    View Slide

  5. HIFI READS MAP THROUGH DIFFICULT REGIONS
    Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74,
    5463 (2019).
    Short
    reads
    PacBio
    HiFi
    STRC
    STRC is a congenital deafness gene that requires long reads to cover all exons.

    View Slide

  6. PACBIO HIFI DATASETS FOR GIAB SAMPLES
    Each dataset sequenced to approximately 30-fold coverage
    Sample
    Insert
    length Platform Reads (SRA) Alignments
    HG002 10 kb Sequel System https://bit.ly/2OCLeA2 https://bit.ly/2OCLeA2
    HG002 15 kb Sequel System PRJNA520771 https://bit.ly/2p1ISA8
    HG002 11 kb Sequel II System PRJNA527278 https://bit.ly/2VqdJm1
    HG001 11 kb Sequel II System PRJNA540705 https://bit.ly/2AWtVSM
    HG005 11 kb Sequel II System PRJNA540706 https://bit.ly/2ogGbuI

    View Slide

  7. DETECTING VARIANTS IN HIFI READS WITH GATK
    HAPLOTYPECALLER
    DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011).
    Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74,
    5463 (2019).
    HiFi reads
    pbmm2
    HaplotypeCaller
    VariantFiltration
    variant calls (vcf)
    GATK4
    SMRT Link
    Mapping

    View Slide

  8. DETECTING VARIANTS IN HIFI READS WITH GATK
    HAPLOTYPECALLER
    DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011).
    Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74,
    5463 (2019).
    HiFi reads
    pbmm2
    HaplotypeCaller
    VariantFiltration
    variant calls (vcf)
    GATK4
    SMRT Link
    Mapping
    -High SNP Recall and Precision
    -Lower Indel Recall and Precision, due to
    1bp indel errors

    View Slide

  9. DETECTING VARIANTS IN HIFI READS WITH GATK
    HAPLOTYPECALLER
    DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011).
    Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74,
    5463 (2019).
    HiFi reads
    pbmm2
    HaplotypeCaller
    VariantFiltration
    variant calls (vcf)
    GATK4
    SMRT Link
    Mapping
    -High SNP Recall and Precision
    -Lower Indel Recall and Precision, due to
    1bp indel errors
    -HaplotypeCaller optimized for error
    mode of short reads
    Indel
    Mismatch
    96.6%
    PacBio HiFi
    99.1%
    Short reads

    View Slide

  10. DETECTING VARIANTS IN HIFI READS WITH GATK
    HAPLOTYPECALLER
    DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011).
    Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74,
    5463 (2019).
    HiFi reads
    pbmm2
    HaplotypeCaller
    VariantFiltration
    variant calls (vcf)
    GATK4
    SMRT Link
    Mapping
    -High SNP Recall and Precision
    -Lower Indel Recall and Precision, due to
    1bp indel errors
    -HaplotypeCaller optimized for error
    mode of short reads
    -We recommend using a caller that can
    adapt to the error mode of long reads,
    such as DeepVariant
    (see Pi-Chuan Chang’s lightning talk)

    View Slide

  11. V4 DRAFT BENCHMARK INCREASES PRECISION AND TRUE
    POSITIVE VARIANTS
    Recall Precision TP
    SNVs 99.7% 99.8% 3,314,633
    Indels 86.1% 92.8% 444,945
    Recall Precision TP
    SNVs 99.7% 99.7% 3,306,764
    Indels 85.9% 92.7% 444,342
    GRCh38
    hs37d5
    Recall Precision TP
    SNVs 99.8% 99.6% 3,042,089
    Indels 86.3% 92.4% 401,306
    Recall Precision TP
    SNVs 99.8% 99.5% 3,022,502
    Indels 83.8% 92.5% 398,726
    + ~272k TP SNPs
    + ~44k TP INDELs
    + ~284k TP SNPs
    + ~46k TP INDELs
    v3.3.2
    v4 draft

    View Slide

  12. MANUAL CURATION OF FP AND FN
    General themes:
    GATK misses or makes incorrect indel calls in homopolymer stretches
    GATK false positives due to mis-mapped LINE elements and segmental duplications
    GATK false negatives due to low coverage depth or mapping quality
    15
    19
    2 3
    1
    0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
    Putative FN
    Putative FP
    Manually Curated Discordant Variants
    Benchmark Correct GATK Callset Correct Unsure
    Opportunities to improve variant calling:
    -incorrect indel calls in homopolymer stretches (FP + FN)
    -mis-mapped LINE elements and segmental duplications (FP)
    -low mapping quality (FN)

    View Slide

  13. FN IN CALLSET - UNSURE ABOUT BENCHMARK
    Benchmark - homozygous T➔A
    A/A
    A/TA
    T/A
    A/TA
    Illumina
    PacBio HiFi
    ONT
    10X

    View Slide

  14. FP IN CALLSET - UNSURE ABOUT BENCHMARK
    Illumina
    PacBio HiFi
    ONT
    10X
    no coverage
    C/T
    C/T
    C/T (odd allele frequency)
    Benchmark - no call

    View Slide

  15. FP IN CALLSET - UNSURE ABOUT BENCHMARK (CONT’D)
    Illumina
    PacBio HiFi
    ONT
    10X

    View Slide

  16. Illumina
    FP + FN IN CALLSET - BENCHMARK INCORRECT FOR STR
    CONTRACTION
    Benchmark - GGAG⨯9 deletion
    low coverage
    GGAG⨯2 deletion
    ~GGAG⨯2 deletion
    GGAG⨯2 deletion
    PacBio HiFi
    ONT
    10X

    View Slide

  17. CONCLUSIONS
    -v4 draft benchmark satisfies GIAB goal for GATK calls on HiFi reads:
    -75% of putative FN and 95% of putative FP are clearly errors in the GATK callset
    -Suggestions for improving the benchmark:
    -Exclude regions with SNV disagreements between long/linked read datasets or odd
    SNV frequencies (2:1, 3:1) in long/linked read datasets
    -Require support from long reads for indels in repetitive regions with low short read
    coverage

    View Slide

  18. For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio,
    SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO
    Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc.
    All other trademarks are the sole property of their respective owners.
    www.pacb.com
    Poster 1866/W
    Booth 1020

    View Slide