Other GEMINI tools

91f1e43339bdc1bd3690295bfaeeb17e?s=47 Aaron Quinlan
August 19, 2015

Other GEMINI tools

91f1e43339bdc1bd3690295bfaeeb17e?s=128

Aaron Quinlan

August 19, 2015
Tweet

Transcript

  1. GEMINI (cont.) Aaron Quinlan and Brent Pedersen University of Utah

    ! ! ! ! ! quinlanlab.org
  2. Analysis options 15 min: finish up recessive and ROH 5

    min: annotate tool 5 min: variant impacts 10 min: Mendel violations 10: faster, more flexible loading (VCFANNO) 10: gene.iobio
  3. Finding canonical recessive variants with GEMINI

  4. The autosomal_recessive tool. http://gemini.readthedocs.org/en/latest/content/tools.html#autosomal-recessive-find-variants-meeting-an-autosomal-recessive-model

  5. The autosomal_recessive tool. Default behavior:

  6. The autosomal_recessive tool. The -­‐-­‐min-­‐kindreds  option: This specifies the number

    of families required to have a variant in the same gene in order for it to be reported. For example, we may only be interested in candidates where at least 2 families have a variant in that gene.
  7. The autosomal_recessive tool. -­‐-­‐filter  for variants with potential functional consequence:

  8. The autosomal_recessive tool: other options The —gt-­‐pl-­‐max  option: In order

    to eliminate less confident genotypes, it is possible to enforce a maximum PL value for each sample. On this scale, lower values indicate more confidence that the called genotype is correct. 10 is a reasonable value: What is the “PL”? https://samtools.github.io/hts-specs/VCFv4.2.pdf What is a “Phred scaled” genotype likelihood?
  9. The autosomal_recessive tool: other options What is a “Phred scaled”

    genotype likelihood? Example calculation based on the GATK HaplotypeCaller http://gatkforums.broadinstitute.org/discussion/5913/math-notes-how-pl-is-calculated-in-haplotypecaller
  10. Runs of homozygosity Method 1: intersecting with previously known regions

  11. bgzip homoz_region.bed tabix -p bed homoz_region.bed.gz Intersect with observed homozygosity

    region(s) (Example commands) 1. Tabix a BED file with the observed homozygosity regions gemini annotate -f homoz_region.bed.gz \ –c homoz_region \ -t boolean \ AR.db 2. Use the annotate tool to flag variants that overlap these regions. gemini autosomal_recessive AR.db --columns "chrom, start, end, ref, alt, filter, qual, gene, impact, aaf_esp_ea, aaf_1kg_eur” -–filter "filter is NULL and aaf_esp_ea < 0.1 and (impact_severity = 'HIGH' or impact_severity = 'MED') and homoz_region ==1” 3. Filter variants for those that overlap these regions.
  12. Runs of homozygosity Method 2: search for runs of homozygosity

  13. Intersect with observed homozygosity region(s) gemini roh AR.db Run the

    roh tool to search for candidate runs of homozygosity.
  14. Intersect with observed homozygosity region(s) gemini roh AR.db | sort

    -k7nr Run the roh tool to search for candidate runs of homozygosity. chrom start end sample num_of_snps density_per_kb run_length_in_bp! chr2 92134506 95353861 S101 151 0.0475 3219355! chr2 92188685 95359306 S101 75 0.0243 3170621! chr2 92189814 95353861 S106 68 0.0221 3164047! chr2 92189814 95329940 S138 50 0.0166 3140126! chr2 92262183 95365848 S106 57 0.019 3103665! chr2 92264378 95357391 S138 39 0.0133 3093013! chr2 233336080 234631638 S138 2583 1.9953 1295558! chr2 184922511 185985238 S101 1983 1.8678 1062727! chr2 119980657 121036333 S106 1921 1.8216 1055676! chr2 184994526 186030052 S101 1956 1.8908 1035526! chr2 136161952 137038729 S106 1529 1.7462 876777! chr2 119952406 120805007 S106 1601 1.8801 852601! chr2 135168500 136012610 S106 1509 1.7901 844110! chr2 136266716 137041366 S106 1372 1.7737 774650! chr2 134764013 135515261 S101 1559 2.0779 751248! chr2 134821134 135536838 S101 1477 2.0665 715704! chr2 184390618 185105579 S138 1453 2.0351 714961! chr2 187780557 188481382 S101 1279 1.8278 700825! chr2 135162250 135837013 S106 1268 1.8821 674763
  15. Caveats when screening for runs of homozygosity 1. Difficult with

    exome data. Density of markers. 2. Shorter runs of homozygosity happen often by chance. 3. Density of homozygotes is important. http://www.nature.com/nature/journal/v449/n7164/extref/nature06258-s1.pdf
  16. The GEMINI annotate tool

  17. Goal: extend a GEMINI database with custom annotations http://gemini.readthedocs.org/en/latest/content/tools.html#annotate-adding-your-own-custom-annotations

  18. Goal: extend a GEMINI database with custom annotations http://gemini.readthedocs.org/en/latest/content/tools.html#annotate-adding-your-own-custom-annotations

  19. Goal: extend a GEMINI database with custom annotations http://gemini.readthedocs.org/en/latest/content/tools.html#annotate-adding-your-own-custom-annotations

  20. Goal: extend a GEMINI database with custom annotations http://gemini.readthedocs.org/en/latest/content/tools.html#annotate-adding-your-own-custom-annotations

  21. Goal: extend a GEMINI database with custom annotations http://gemini.readthedocs.org/en/latest/content/tools.html#annotate-adding-your-own-custom-annotations

  22. The variant_impacts table

  23. variant_impacts tracks the functional impact on every transcript …whereas the

    variants table stores the most deleterious impact
  24. The mendelian_error tool

  25. mendelian_error $  gemini  mendel_errors  -­‐-­‐columns  "chrom,start,end"  test.mendel.db  -­‐-­‐gt-­‐pl-­‐max  1  

    chrom              start      end            family_members              family_genotypes              violation                              violation_prob   chr1                10670      10671        dad,mom,child                G/G,G/G,G/C                        plausible  de  novo              0.962   chr1                28493      28494        dad,mom,child                T/C,T/T,C/C                        loss  of  heterozygosity    0.660   chr1                28627      28628        dad,mom,child                C/C,C/C,C/T                        plausible  de  novo              0.989   chr1                267558    267560      dad,mom,child                C/C,C/C,CT/C                      plausible  de  novo              0.896   chr1                537969    537970      dad,mom,child                C/C,C/C,C/T                        plausible  de  novo              0.928   chr1                547518    547519      dad,mom,child                G/G,G/G,G/T                        plausible  de  novo              1.000   chr1                589081    589086      dad,mom,child                G/G,GAGAA/GAGAA,G/G        uniparental  disomy            0.940   chr1                749688    749689      dad,mom,child                T/T,T/T,G/G                        implausible  de  novo          0.959   chr1                788944    788945      dad,mom,child                C/C,G/G,G/G                        uniparental  disomy            0.914   chr1                1004248  1004249    dad,mom,child                G/G,G/G,G/C                        plausible  de  novo              1.000
  26. Speeding up database loading with vcfanno

  27. GEMINI current • recommended, vetted annotations • fast loading •

    any organism supported by VEP / SnpEff • custom annotations treated same as vetted
  28. GEMINI future • most annotations are fixed • can add

    some custom annotations after load • loading is slow • stuck to single genome version
  29. vcfanno https://github.com/brentp/vcfanno

  30. Configuration [[annotation]]   file="ALL.wgs.phase3_shapeit2_mvncall_integrated_v5a. 20130502.sites.tidy.vcf.gz"   fields=["AF",  "AMR_AF",  "EUR_AF",  ...]

      names=["in_1kg_flag",  "aaf_1kg_amr_float",   "aaf_1kg_eas_float",  ...]   ops=["flag",  "max",  "max",  ...]   ! Specify an annotation, file, which (VCF INFO) fields to pull, and how to report them. ! We’ll include a vetted file like this for gemini for human, but users can modify it and/or create their own for other organisms. ! Possible to create custom database with only columns of interest. https://github.com/brentp/vcfanno/blob/master/example/gem.conf
  31. vcfanno loading performance • 10 million ExAC variants annotated with

    34 annotations from 11 distinct files in ~30 minutes • current gemini takes at least 40X longer to load the same number of variants.
  32. Modularize functionality • pedigree (determine modes of inheritance) • inheritance

    models (rules for autosomal rec/ dom, de novo) • effect parsing/prioritizing • normalize between SnpEff / VEP • prioritize by impact (missense over synonymous) separating functionality improves code reuse, eases testing, and simplifies code maintenance
  33. gemini query performance improvements with bcolz

  34. gene.iobio.io Gabor Marth’s Laboratory University of Utah !

  35. gene.iobio.io