Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Other GEMINI tools

Aaron Quinlan
August 19, 2015

Other GEMINI tools

Aaron Quinlan

August 19, 2015
Tweet

More Decks by Aaron Quinlan

Other Decks in Science

Transcript

  1. GEMINI (cont.)
    Aaron Quinlan and Brent Pedersen
    University of Utah
    !
    !
    !
    !
    !
    quinlanlab.org

    View Slide

  2. Analysis options
    15 min: finish up recessive and ROH
    5 min: annotate tool
    5 min: variant impacts
    10 min: Mendel violations
    10: faster, more flexible loading (VCFANNO)
    10: gene.iobio

    View Slide

  3. Finding canonical recessive
    variants with GEMINI

    View Slide

  4. The autosomal_recessive tool.
    http://gemini.readthedocs.org/en/latest/content/tools.html#autosomal-recessive-find-variants-meeting-an-autosomal-recessive-model

    View Slide

  5. The autosomal_recessive tool.
    Default behavior:

    View Slide

  6. The autosomal_recessive tool.
    The -­‐-­‐min-­‐kindreds  option:
    This specifies the number of families required to have a variant in
    the same gene in order for it to be reported. For example, we may
    only be interested in candidates where at least 2 families have a
    variant in that gene.

    View Slide

  7. The autosomal_recessive tool.
    -­‐-­‐filter  for variants with potential functional consequence:

    View Slide

  8. The autosomal_recessive tool: other options
    The —gt-­‐pl-­‐max  option:
    In order to eliminate less confident genotypes, it is possible to
    enforce a maximum PL value for each sample. On this scale, lower
    values indicate more confidence that the called genotype is
    correct. 10 is a reasonable value:
    What is the “PL”?
    https://samtools.github.io/hts-specs/VCFv4.2.pdf
    What is a “Phred scaled” genotype likelihood?

    View Slide

  9. The autosomal_recessive tool: other options
    What is a “Phred scaled” genotype likelihood? Example calculation based on the GATK HaplotypeCaller
    http://gatkforums.broadinstitute.org/discussion/5913/math-notes-how-pl-is-calculated-in-haplotypecaller

    View Slide

  10. Runs of homozygosity
    Method 1: intersecting with
    previously known regions

    View Slide

  11. bgzip homoz_region.bed
    tabix -p bed homoz_region.bed.gz
    Intersect with observed homozygosity region(s)
    (Example commands)
    1. Tabix a BED file with the observed homozygosity regions
    gemini annotate -f homoz_region.bed.gz \
    –c homoz_region \
    -t boolean \
    AR.db
    2. Use the annotate tool to flag variants that overlap these regions.
    gemini autosomal_recessive AR.db --columns "chrom, start, end,
    ref, alt, filter, qual, gene, impact, aaf_esp_ea, aaf_1kg_eur”
    -–filter "filter is NULL and aaf_esp_ea < 0.1 and
    (impact_severity = 'HIGH' or impact_severity = 'MED') and
    homoz_region ==1”
    3. Filter variants for those that overlap these regions.

    View Slide

  12. Runs of homozygosity
    Method 2: search for runs of
    homozygosity

    View Slide

  13. Intersect with observed homozygosity region(s)
    gemini roh AR.db
    Run the roh tool to search for candidate runs of homozygosity.

    View Slide

  14. Intersect with observed homozygosity region(s)
    gemini roh AR.db | sort -k7nr
    Run the roh tool to search for candidate runs of homozygosity.
    chrom start end sample num_of_snps density_per_kb run_length_in_bp!
    chr2 92134506 95353861 S101 151 0.0475 3219355!
    chr2 92188685 95359306 S101 75 0.0243 3170621!
    chr2 92189814 95353861 S106 68 0.0221 3164047!
    chr2 92189814 95329940 S138 50 0.0166 3140126!
    chr2 92262183 95365848 S106 57 0.019 3103665!
    chr2 92264378 95357391 S138 39 0.0133 3093013!
    chr2 233336080 234631638 S138 2583 1.9953 1295558!
    chr2 184922511 185985238 S101 1983 1.8678 1062727!
    chr2 119980657 121036333 S106 1921 1.8216 1055676!
    chr2 184994526 186030052 S101 1956 1.8908 1035526!
    chr2 136161952 137038729 S106 1529 1.7462 876777!
    chr2 119952406 120805007 S106 1601 1.8801 852601!
    chr2 135168500 136012610 S106 1509 1.7901 844110!
    chr2 136266716 137041366 S106 1372 1.7737 774650!
    chr2 134764013 135515261 S101 1559 2.0779 751248!
    chr2 134821134 135536838 S101 1477 2.0665 715704!
    chr2 184390618 185105579 S138 1453 2.0351 714961!
    chr2 187780557 188481382 S101 1279 1.8278 700825!
    chr2 135162250 135837013 S106 1268 1.8821 674763

    View Slide

  15. Caveats when screening for runs of homozygosity
    1. Difficult with exome data. Density of markers.
    2. Shorter runs of homozygosity happen often by chance.
    3. Density of homozygotes is important.
    http://www.nature.com/nature/journal/v449/n7164/extref/nature06258-s1.pdf

    View Slide

  16. The GEMINI annotate tool

    View Slide

  17. Goal: extend a GEMINI database with custom annotations
    http://gemini.readthedocs.org/en/latest/content/tools.html#annotate-adding-your-own-custom-annotations

    View Slide

  18. Goal: extend a GEMINI database with custom annotations
    http://gemini.readthedocs.org/en/latest/content/tools.html#annotate-adding-your-own-custom-annotations

    View Slide

  19. Goal: extend a GEMINI database with custom annotations
    http://gemini.readthedocs.org/en/latest/content/tools.html#annotate-adding-your-own-custom-annotations

    View Slide

  20. Goal: extend a GEMINI database with custom annotations
    http://gemini.readthedocs.org/en/latest/content/tools.html#annotate-adding-your-own-custom-annotations

    View Slide

  21. Goal: extend a GEMINI database with custom annotations
    http://gemini.readthedocs.org/en/latest/content/tools.html#annotate-adding-your-own-custom-annotations

    View Slide

  22. The variant_impacts table

    View Slide

  23. variant_impacts tracks the functional impact on every transcript
    …whereas the variants table stores the most deleterious impact

    View Slide

  24. The mendelian_error tool

    View Slide

  25. mendelian_error
    $  gemini  mendel_errors  -­‐-­‐columns  "chrom,start,end"  test.mendel.db  -­‐-­‐gt-­‐pl-­‐max  1  
    chrom              start      end            family_members              family_genotypes              violation                              violation_prob  
    chr1                10670      10671        dad,mom,child                G/G,G/G,G/C                        plausible  de  novo              0.962  
    chr1                28493      28494        dad,mom,child                T/C,T/T,C/C                        loss  of  heterozygosity    0.660  
    chr1                28627      28628        dad,mom,child                C/C,C/C,C/T                        plausible  de  novo              0.989  
    chr1                267558    267560      dad,mom,child                C/C,C/C,CT/C                      plausible  de  novo              0.896  
    chr1                537969    537970      dad,mom,child                C/C,C/C,C/T                        plausible  de  novo              0.928  
    chr1                547518    547519      dad,mom,child                G/G,G/G,G/T                        plausible  de  novo              1.000  
    chr1                589081    589086      dad,mom,child                G/G,GAGAA/GAGAA,G/G        uniparental  disomy            0.940  
    chr1                749688    749689      dad,mom,child                T/T,T/T,G/G                        implausible  de  novo          0.959  
    chr1                788944    788945      dad,mom,child                C/C,G/G,G/G                        uniparental  disomy            0.914  
    chr1                1004248  1004249    dad,mom,child                G/G,G/G,G/C                        plausible  de  novo              1.000

    View Slide

  26. Speeding up database loading
    with vcfanno

    View Slide

  27. GEMINI current
    • recommended, vetted annotations
    • fast loading
    • any organism supported by VEP / SnpEff
    • custom annotations treated same as vetted

    View Slide

  28. GEMINI future
    • most annotations are fixed
    • can add some custom annotations after load
    • loading is slow
    • stuck to single genome version

    View Slide

  29. vcfanno https://github.com/brentp/vcfanno

    View Slide

  30. Configuration
    [[annotation]]  
    file="ALL.wgs.phase3_shapeit2_mvncall_integrated_v5a.
    20130502.sites.tidy.vcf.gz"  
    fields=["AF",  "AMR_AF",  "EUR_AF",  ...]  
    names=["in_1kg_flag",  "aaf_1kg_amr_float",  
    "aaf_1kg_eas_float",  ...]  
    ops=["flag",  "max",  "max",  ...]  
    !
    Specify an annotation, file, which (VCF INFO) fields to pull, and how to report
    them.
    !
    We’ll include a vetted file like this for gemini for human, but users can modify it
    and/or create their own for other organisms.
    !
    Possible to create custom database with only columns of interest.
    https://github.com/brentp/vcfanno/blob/master/example/gem.conf

    View Slide

  31. vcfanno loading performance
    • 10 million ExAC variants annotated with 34
    annotations from 11 distinct files in ~30
    minutes
    • current gemini takes at least 40X longer to
    load the same number of variants.

    View Slide

  32. Modularize functionality
    • pedigree (determine modes of inheritance)
    • inheritance models (rules for autosomal rec/
    dom, de novo)
    • effect parsing/prioritizing
    • normalize between SnpEff / VEP
    • prioritize by impact (missense over
    synonymous)
    separating functionality improves code reuse, eases testing, and simplifies code
    maintenance

    View Slide

  33. gemini query performance improvements with bcolz

    View Slide

  34. gene.iobio.io
    Gabor Marth’s Laboratory
    University of Utah
    !

    View Slide

  35. gene.iobio.io

    View Slide