Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Detecting and Phasing Small Variants with Highly Accurate Long Reads

William Rowell
January 16, 2019

Detecting and Phasing Small Variants with Highly Accurate Long Reads

We summarize the challenges around small variant detection for highly accurate (>=99%) long reads and present workflow solutions using existing tools (GATK) and new tools (DeepVariant with trained CCS model). Presented at PacBio SMRT Informatics Workshop in San Diego.

William Rowell

January 16, 2019
Tweet

More Decks by William Rowell

Other Decks in Science

Transcript

  1. For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved.
    Detecting and Phasing Small Variants with
    Highly Accurate Long Reads
    William Rowell, Senior Scientist, Bioinformatics Applications, PacBio
    SMRT Informatics Developers Conference, January 16, 2019 @nothingclever

    View full-size slide

  2. AGENDA
    -Differences between highly accurate long
    reads and short reads
    -Calling variants with existing tools
    -Training new tools on long reads
    -Making use of phase information to
    improve variant calls

    View full-size slide

  3. AGENDA
    -Differences between highly accurate long
    reads and short reads
    -Calling variants with existing tools
    -Training new tools on long reads
    -Making use of phase information to
    improve variant calls

    View full-size slide

  4. CCS READS HAVE A DIFFERENT ERROR PROFILE FROM
    SHORT READS

    View full-size slide

  5. CCS READS HAVE A DIFFERENT ERROR PROFILE FROM
    SHORT READS

    View full-size slide

  6. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS
    STARTS WITH A FAMILIAR WORKFLOW
    CCS Reads
    Align with minimap2
    Detect variants with
    GATK HaplotypeCaller
    Hard filter variants with
    GATK VariantFiltration
    Diploid variant calls

    View full-size slide

  7. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS
    FOLLOWS A FAMILIAR WORKFLOW
    CCS Reads
    Align with minimap2
    (pbmm2)
    Detect variants with
    GATK HaplotypeCaller
    Hard filter variants with
    GATK VariantFiltration
    Diploid variant calls
    pbmm2 --preset CCS

    View full-size slide

  8. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS
    FOLLOWS A FAMILIAR WORKFLOW
    CCS Reads
    Align with minimap2
    (pbmm2)
    Detect variants with
    GATK HaplotypeCaller
    Hard filter variants with
    GATK VariantFiltration
    Diploid variant calls
    pbmm2 --preset CCS
    --pcr-indel-model AGGRESSIVE

    View full-size slide

  9. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS
    FOLLOWS A FAMILIAR WORKFLOW
    CCS Reads
    Align with minimap2
    (pbmm2)
    Detect variants with
    GATK HaplotypeCaller
    Hard filter variants with
    GATK VariantFiltration
    Diploid variant calls
    Strand Bias tests ❌
    Mapping Quality tests ❌
    Read position tests ❌
    Variant Quality tests ✅

    View full-size slide

  10. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS
    FOLLOWS A FAMILIAR WORKFLOW
    CCS Reads
    Align with minimap2
    (pbmm2)
    Detect variants with
    GATK HaplotypeCaller
    Hard filter variants with
    GATK VariantFiltration
    Diploid variant calls

    View full-size slide

  11. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS
    FOLLOWS A FAMILIAR WORKFLOW
    CCS Reads
    Align with minimap2
    (pbmm2)
    Detect variants with
    GATK HaplotypeCaller
    Hard filter variants with
    GATK VariantFiltration
    Diploid variant calls

    View full-size slide

  12. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS
    FOLLOWS A FAMILIAR WORKFLOW
    CCS Reads
    Align with minimap2
    (pbmm2)
    Detect variants with
    GATK HaplotypeCaller
    Hard filter variants with
    GATK VariantFiltration
    Diploid variant calls
    pbmm2 --preset CCS
    --pcr-indel-model AGGRESSIVE
    SNV → QD >= 2.0
    1 bp Indels → QD >= 5.0
    >1 bp Indels → QD >= 2.0
    Precision Recall F1
    SNVs 99.468% 99.559% 99.513%
    Indels 78.977% 81.248% 80.097%

    View full-size slide

  13. GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA
    TYPES
    CCS Reads (chr. 1-19)
    + GIAB Truth Set
    DeepVariant CNN
    training
    CCS model
    Nature Biotechnology volume 36, pages 983–987 (2018)

    View full-size slide

  14. GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA
    TYPES
    CCS Reads (chr. 1-19)
    + GIAB Truth Set
    DeepVariant CNN
    training
    CCS model
    Nature Biotechnology volume 36, pages 983–987 (2018)
    CCS Reads
    DeepVariant
    + CCS model
    Diploid
    variant calls

    View full-size slide

  15. GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA
    TYPES
    CCS Reads (chr. 1-19)
    + GIAB Truth Set
    DeepVariant CNN
    training
    CCS model
    Nature Biotechnology volume 36, pages 983–987 (2018)
    CCS Reads
    DeepVariant
    + CCS model
    Diploid
    variant calls
    Precision Recall F1
    SNVs 99.914% 99.959% 99.936%
    Indels 96.901% 95.980% 96.438%
    autosomes

    View full-size slide

  16. GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA
    TYPES
    CCS Reads (chr. 1-19)
    + GIAB Truth Set
    DeepVariant CNN
    training
    CCS model
    Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology 36, 983–987 (2018)
    CCS Reads
    DeepVariant
    + CCS model
    Diploid
    variant calls
    Precision Recall F1
    SNVs 99.914% 99.959% 99.936%
    Indels 96.901% 95.980% 96.438%
    Precision Recall F1
    SNVs 99.807% 99.904% 99.855%
    Indels 95.387% 94.501% 94.942%
    autosomes
    chromosome 20

    View full-size slide

  17. BUT WE’RE STILL LEAVING INFORMATION ON THE TABLE…

    View full-size slide

  18. BUT WE’RE STILL LEAVING INFORMATION ON THE TABLE…

    View full-size slide

  19. PHASE INFORMATION CAN IMPROVE PRECISION AND RECALL

    View full-size slide

  20. INCORPORATING PHASE INFORMATION LEADS TO
    IMPROVEMENTS IN VARIANT RECALL AND PRECISION
    Ebler, J. et al. Haplotype-aware genotyping from noisy long reads bioRxiv doi: 10.1101/293944
    Precision Recall F1
    SNVs 99.468% 99.559% 99.513%
    Indels 78.977% 81.248% 80.097%
    Precision Recall F1
    SNVs 99.693% 99.792% 99.742%
    Indels 81.102% 83.818% 82.438%
    GATK HC
    GATK HC + WhatsHap
    Incorporate phase
    information and
    re-genotype
    variant positions
    with WhatsHap

    View full-size slide

  21. INCORPORATING PHASE INFORMATION LEADS TO MODEST
    INCREASES IN VARIANT RECALL AND PRECISION
    Precision Recall F1
    SNVs 99.468% 99.559% 99.513%
    Indels 78.977% 81.248% 80.097%
    Precision Recall F1
    SNVs 99.693% 99.792% 99.742%
    Indels 81.102% 83.818% 82.438%
    Precision Recall F1
    SNVs 99.914% 99.959% 99.936%
    Indels 96.901% 95.980% 96.438%
    Precision Recall F1
    SNVs 99.904% 99.963% 99.934%
    Indels 97.835% 97.141% 97.486%
    GATK HC
    GATK HC + WhatsHap
    DeepVariant CCS
    DeepVariant CCS + haplotype sorting
    Incorporate phase
    information and
    re-genotype
    variant positions
    with WhatsHap
    Tag reads with
    phase information
    from trio data and
    sort alignments by
    haplotype
    https://goo.gl/4cnMeC
    Ebler, J. et al. Haplotype-aware
    genotyping from noisy long reads bioRxiv
    doi: 10.1101/293944

    View full-size slide

  22. CONCLUSIONS
    -GATK HaplotypeCaller (optimized for short reads) can be used to detect
    SNVs with high recall and precision, but has trouble discriminating between
    biological indels and sequencing errors.
    -DeepVariant, when trained on long reads, can be used to detect both SNVs
    and indels with high recall and precision.
    -Both workflows can be improved by providing long-distance phasing
    information, but there’s still work to be done in this area.

    View full-size slide

  23. ACKNOWLEDGEMENTS
    Google AI Genomics
    Alexey Kolesnikov
    Pi-Chuan Chang
    Andrew Carroll
    Mark DePristo
    Saarland University/Max Planck
    Institute for Informatics
    Jana Ebler
    Tobias Marschall
    PacBio
    Yufeng Qian
    Richard Hall
    Aaron Wenger
    Paul Peluso
    David Rank
    Mike Hunkapiller
    DNANexus
    Jason Chin
    NIST
    Nathan Olson
    Justin Zook

    View full-size slide

  24. For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio,
    SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO
    Pulse and Fragment Analyzer are trademarks of Advanced Analytical Technologies.
    All other trademarks are the sole property of their respective owners.
    www.pacb.com

    View full-size slide