Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Identifying recessive candidates with GEMINI

Aaron Quinlan
August 19, 2015

Identifying recessive candidates with GEMINI

Aaron Quinlan

August 19, 2015
Tweet

More Decks by Aaron Quinlan

Other Decks in Science

Transcript

  1. Identifying recessive gene candidates with GEMINI Please refer to the

    following Github Gist to find each command for this session. Commands should be copy/pasted from this Gist Aaron Quinlan University of Utah ! ! ! ! ! quinlanlab.org 1 https://gist.github.com/arq5x/9e1928638397ba45da2e#file-autosomal-recessive-sh
  2. “Phasing” a VCF file by transmission with GATK Jessica Chong

    G/G A/A A/G A/G A/A A/G A/A A/G A/G Jessica Chong 7
  3. “Phasing” a VCF file by transmission with GATK Jessica Chong

    G/G A/A A/G A/G A/A A/G A/A A/G A/G A/G A/G A/G ? Jessica Chong 7
  4. “Phasing” a VCF file by transmission with GEMINI G/G A/A

    A/G A/G A/A A/G A/A A/G A/G A/G A/G A/G ? 8
  5. Phasing by transmission C/C * Convention for phased genotype is

    maternal allele first A/G A/A A/G C/T C/T 9
  6. Phasing by transmission C/C G|A C|T Both sites phasable: high

    confidence as deleterious alleles on different chromosomes * Convention for phased genotype is maternal allele first A/G A/A A/G C/T C/T 9
  7. Phasing by transmission C/C G|A C|T Both sites phasable: high

    confidence as deleterious alleles on different chromosomes * Convention for phased genotype is maternal allele first A/G A/A A/G C/T C/T C/T G/A A/A A/G C/C C/T 9
  8. Phasing by transmission C/C G|A C|T Both sites phasable: high

    confidence as deleterious alleles on different chromosomes * Convention for phased genotype is maternal allele first G|A T|C Both sites phasable: yet exclude as deleterious alleles on same chromosomes A/G A/A A/G C/T C/T C/T G/A A/A A/G C/C C/T 9
  9. Phasing by transmission C/C G|A C|T Both sites phasable: high

    confidence as deleterious alleles on different chromosomes * Convention for phased genotype is maternal allele first G|A T|C Both sites phasable: yet exclude as deleterious alleles on same chromosomes A/G A/A A/G C/T C/T C/T G/G A/A A/G C/T C/T C/T G/A A/A A/G C/C C/T 9
  10. Phasing by transmission C/C G|A C|T Both sites phasable: high

    confidence as deleterious alleles on different chromosomes * Convention for phased genotype is maternal allele first G|A T|C Both sites phasable: yet exclude as deleterious alleles on same chromosomes G|A ? Only one site is phasable: lower confidence but cannot necessarily exclude. A/G A/A A/G C/T C/T C/T G/G A/A A/G C/T C/T C/T G/A A/A A/G C/C C/T 9
  11. Phasing by transmission C/C G|A C|T Both sites phasable: high

    confidence as deleterious alleles on different chromosomes * Convention for phased genotype is maternal allele first G|A T|C Both sites phasable: yet exclude as deleterious alleles on same chromosomes G|A ? Only one site is phasable: lower confidence but cannot necessarily exclude. A/G A/A A/G C/T C/T C/T G/G A/A A/G C/T C/T C/T G/A A/A A/G C/C C/T A/G A/G A/G C/T C/T C/T 9
  12. Phasing by transmission C/C G|A C|T Both sites phasable: high

    confidence as deleterious alleles on different chromosomes * Convention for phased genotype is maternal allele first G|A T|C Both sites phasable: yet exclude as deleterious alleles on same chromosomes G|A ? Only one site is phasable: lower confidence but cannot necessarily exclude. A/G A/A A/G C/T C/T C/T G/G A/A A/G C/T C/T C/T G/A A/A A/G C/C C/T A/G A/G A/G C/T C/T C/T ? ? Neither site is phasable: lower confidence but cannot necessarily exclude (recombination?). 9
  13. The comp_hets tool in GEMINI http://gemini.readthedocs.org/en/latest/content/tools.html#comp-hets-identifying-potential-compound-heterozygotes Requires a PED file

    #family_id    sample_id    paternal_id    maternal_id    sex    phenotype     family1          1805              -­‐9                      -­‐9                      2        1                     family1          1847              -­‐9                      -­‐9                      1        1                     family1          4805              1847                  1805                  1        2                   4805 1847 1805 11
  14. Create a GEMINI database from a VCF Notes: 1. The

    VCF has been normalized and decomposed with VT 2. The VCF has been annotated with VEP. $  curl  https://s3.amazonaws.com/gemini-­‐tutorials/trio.trim.vep.vcf.gz  >  trio.trim.vep.vcf.gz   $  curl  https://s3.amazonaws.com/gemini-­‐tutorials/recessive.ped  >  recessive.ped   $  gemini  load  -­‐-­‐cores  4\                              -­‐v  trio.trim.vep.vcf.gz  \                              -­‐t  VEP  \                              -­‐-­‐skip-­‐gene-­‐tables  \                              -­‐p  recessive.ped  \   !                  trio.trim.vep.recessive.db Note: copy and paste the full command from the Github Gist to avoid errors 4805 1847 1805 http://gemini.readthedocs.org/en/latest/content/preprocessing.html#step-1-split-left-align-and-trim-variants 12
  15. gemini comp_hets --columns "chrom, start, end, gene, impact, cadd_raw" trio.trim.vep.recessive.db

    chrom    start            end                gene      impact                  cadd_raw    variant_id    family_id    family_members                                                                family_genotypes          samples    family_count    comp_het_id              priority   chr2      69702114      69702115      AAK1      UTR_3_prime        -­‐3.52          1638                family1        1805;unaffected,1847;unaffected,4805;affected    G/C,G/C,G/C                  4805          1                          1_1638_1646              2   chr2      69870243      69870244      AAK1      UTR_5_prime        1.04            1646                family1        1805;unaffected,1847;unaffected,4805;affected    G/G,G/A,G|A                  4805          1                          1_1638_1646              2   chr2      69702108      69702109      AAK1      UTR_3_prime        -­‐2.51          1637                family1        1805;unaffected,1847;unaffected,4805;affected    A/C,A/C,A/C                  4805          1                          2_1637_1638              2   chr2      69702114      69702115      AAK1      UTR_3_prime        -­‐3.52          1638                family1        1805;unaffected,1847;unaffected,4805;affected    G/C,G/C,G/C                  4805          1                          2_1637_1638              2   chr2      69702101      69702102      AAK1      UTR_3_prime        0.39            1636                family1        1805;unaffected,1847;unaffected,4805;affected    T/T,T/T,T/C                  4805          1                          3_1636_1646              2   chr2      69870243      69870244      AAK1      UTR_5_prime        1.04            1646                family1        1805;unaffected,1847;unaffected,4805;affected    G/G,G/A,G|A                  4805          1                          3_1636_1646              2   chr2      69702114      69702115      AAK1      UTR_3_prime        -­‐3.52          1638                family1        1805;unaffected,1847;unaffected,4805;affected    G/C,G/C,G/C                  4805          1                          4_1638_1645              2   chr2      69870216      69870218      AAK1      UTR_5_prime        None            1645                family1        1805;unaffected,1847;unaffected,4805;affected    AT/A,AT/A,AT/A            4805          1                          4_1638_1645              2   chr2      69702101      69702102      AAK1      UTR_3_prime        0.39            1636                family1        1805;unaffected,1847;unaffected,4805;affected    T/T,T/T,T/C                  4805          1                          5_1636_1638              2 Again, we can limit the attributes returned w/ the --columns option. Note: copy and paste the full command from the Github Gist 14
  16. Start with highest priority compound heterozygote candidates C/C G|A C|T

    Both sites phasable: high confidence as deleterious alleles on different chromosomes A/G A/A A/G C/T C/T 15
  17. Restrict to highest priority (i.e, priority==1) candidates $ gemini comp_hets

    \ --columns "chrom, start, end, gene, impact, cadd_raw" \ trio.trim.vep.recessive.db \ | awk '$14==1' \ | head chrom    start            end                gene      impact                          cadd_raw    variant_id    family_id    family_members                                                                  family_genotypes      samples    family_count    comp_het_id            priority   chr15    89398552      89398553      ACAN      non_syn_coding          2.57            9519                family1        1805;unaffected,1847;unaffected,4805;affected    C/A,C/C,A|C                4805          1                          52_9519_9534          1   chr15    89417237      89417238      ACAN      non_syn_coding          -­‐0.06          9534                family1        1805;unaffected,1847;unaffected,4805;affected    A/A,A/G,A|G                4805          1                          52_9519_9534          1   chr15    89398552      89398553      ACAN      non_syn_coding          2.57            9519                family1        1805;unaffected,1847;unaffected,4805;affected    C/A,C/C,A|C                4805          1                          53_9519_9535          1   chr15    89417628      89417629      ACAN      splice_region            -­‐0.18          9535                family1        1805;unaffected,1847;unaffected,4805;affected    G/G,G/A,G|A                4805          1                          53_9519_9535          1   chr2      111598957    111598958    ACOXL    non_syn_coding          0.95            3247                family1        1805;unaffected,1847;unaffected,4805;affected    C/C,C/T,C|T                4805          1                          86_3247_3251          1   chr2      111850514    111850515    ACOXL    non_syn_coding          3.17            3251                family1        1805;unaffected,1847;unaffected,4805;affected    C/T,C/C,T|C                4805          1                          86_3247_3251          1   chr17    55182877      55182878      AKAP1    non_syn_coding          2.57            13305              family1        1805;unaffected,1847;unaffected,4805;affected    C/T,C/C,T|C                4805          1                          104_13305_13308    1   chr17    55183316      55183317      AKAP1    synonymous_coding    0.43            13308              family1        1805;unaffected,1847;unaffected,4805;affected    G/G,G/A,G|A                4805          1                          104_13305_13308    1   chr2      29448409      29448410      ALK        non_syn_coding          1.49            839                  family1        1805;unaffected,1847;unaffected,4805;affected    T/T,T/G,T|G                4805          1                          186_839_841            1   chr2      29455198      29455199      ALK        non_syn_coding          3.51            841                  family1        1805;unaffected,1847;unaffected,4805;affected    A/T,A/A,T|A                4805          1                          186_839_841            1 $ gemini comp_hets \ --columns "chrom, start, end, gene, impact, cadd_raw" \ trio.trim.vep.recessive.db \ | awk '$14==1' \ | wc -l 612  lines Each compund heterozygote is a set of two lines, so we have 306 (612 / 2) compound heterozygote candidates Note: copy and paste the full command from the Github Gist 16
  18. So many candidates. Time to start --filtering! $ gemini comp_hets

    \ --columns "chrom, start, end, gene, impact, cadd_raw" \ --filter "impact_severity != 'LOW'" \ trio.trim.vep.recessive.db \ | awk '$14==1' \ | wc -l 260  lines  (130  comp_hets) Note: copy and paste the full command from the Github Gist 17
  19. $ gemini comp_hets \ --columns "chrom, start, end, gene, impact,

    cadd_raw" \ --filter "impact_severity != 'LOW' \ and ((aaf_esp_ea <= 0.01 or aaf_esp_ea is NULL) \ and (aaf_exac_all <= 0.01 or aaf_exac_all is NULL))” \ trio.trim.vep.recessive.db \ | awk '$14==1' \ | wc -l Use ESP and ExAC to focus on rare variants 8  lines,  4  comp_hets Note: copy and paste the full command from the Github Gist 18
  20. $ gemini comp_hets \ --columns "chrom, start, end, gene, impact,

    cadd_raw" \ --filter "impact_severity != 'LOW' \ and ((aaf_esp_ea <= 0.01 or aaf_esp_ea is NULL) \ and (aaf_exac_all <= 0.01 or aaf_exac_all is NULL))” \ trio.trim.vep.recessive.db \ | awk '$14==1' Use ESP and ExAC to focus on rare variants chr17    78081692      78081693      GAA                      non_syn_coding    4.33    14401    family1    1805;unaffected,1847;unaffected4805;affected    T/T,T/C,T|C    4805    1    16_14401_14406      1   chr17    78084748      78084749      GAA                      non_syn_coding    5.51    14406    family1    1805;unaffected,1847;unaffected4805;affected    G/A,G/G,A|G    4805    1    16_14401_14406      1   chr2      129025757    129025758    HS6ST1                non_syn_coding    2.67    3657      family1    1805;unaffected,1847;unaffected4805;affected    C/A,C/C,A|C    4805    1    45_3657_3660          1   chr2      129075743    129075744    HS6ST1                non_syn_coding    3.25    3660      family1    1805;unaffected,1847;unaffected4805;affected    G/G,G/C,G|C    4805    1    45_3657_3660          1   chr22    45255643      45255644      PRR5-­‐ARHGAP8    non_syn_coding    1.88    16838    family1    1805;unaffected,1847;unaffected4805;affected    G/G,G/T,G|T    4805    1    169_16838_16839    1   chr22    45255687      45255688      PRR5-­‐ARHGAP8    non_syn_coding    0.37    16839    family1    1805;unaffected,1847;unaffected4805;affected    C/T,C/C,T|C    4805    1    169_16838_16839    1   chr15    71548994      71548995      THSD4                  non_syn_coding    3.68    8777      family1    1805;unaffected,1847;unaffected4805;affected    C/C,C/T,C|T    4805    1    181_8777_8780        1   chr15    71952898      71952899      THSD4                  non_syn_coding    4.52    8780      family1    1805;unaffected,1847;unaffected4805;affected    G/A,G/G,A|G    4805    1    181_8777_8780        1 Note: copy and paste the full command from the Github Gist 19
  21. Load the following files into IGV (Load from URL) and

    inspect your candidates BAM alignment files: ! https://s3.amazonaws.com/gemini-­‐tutorials/1805.workshop.bam   https://s3.amazonaws.com/gemini-­‐tutorials/1847.workshop.bam   https://s3.amazonaws.com/gemini-­‐tutorials/4805.workshop.bam VCF variant file: ! https://s3.amazonaws.com/gemini-­‐tutorials/trio.trim.vep.vcf.gz   ! 20
  22. The autosomal_recessive tool. The -­‐-­‐min-­‐kindreds  option: This specifies the number

    of families required to have a variant in the same gene in order for it to be reported. For example, we may only be interested in candidates where at least 2 families have a variant in that gene. 24
  23. The autosomal_recessive tool: other options The —gt-­‐pl-­‐max  option: In order

    to eliminate less confident genotypes, it is possible to enforce a maximum PL value for each sample. On this scale, lower values indicate more confidence that the called genotype is correct. 10 is a reasonable value: What is the “PL”? https://samtools.github.io/hts-specs/VCFv4.2.pdf What is a “Phred scaled” genotype likelihood? 26
  24. The autosomal_recessive tool: other options What is a “Phred scaled”

    genotype likelihood? Example calculation based on the GATK HaplotypeCaller http://gatkforums.broadinstitute.org/discussion/5913/math-notes-how-pl-is-calculated-in-haplotypecaller 27
  25. bgzip homoz_region.bed tabix -p bed homoz_region.bed.gz Intersect with observed homozygosity

    region(s) (Example commands) 1. Tabix a BED file with the observed homozygosity regions gemini annotate -f homoz_region.bed.gz \ –c homoz_region \ -t boolean \ AR.db 2. Use the annotate tool to flag variants that overlap these regions. gemini autosomal_recessive AR.db --columns "chrom, start, end, ref, alt, filter, qual, gene, impact, aaf_esp_ea, aaf_1kg_eur” -–filter "filter is NULL and aaf_esp_ea < 0.1 and (impact_severity = 'HIGH' or impact_severity = 'MED') and region ==1” 3. Filter variants for those that overlap these regions. 29
  26. Intersect with observed homozygosity region(s) gemini roh AR.db Run the

    roh tool to search for candidate runs of homozygosity. 31
  27. Intersect with observed homozygosity region(s) gemini roh AR.db | sort

    -k7nr Run the roh tool to search for candidate runs of homozygosity. chrom start end sample num_of_snps density_per_kb run_length_in_bp! chr2 92134506 95353861 S101 151 0.0475 3219355! chr2 92188685 95359306 S101 75 0.0243 3170621! chr2 92189814 95353861 S106 68 0.0221 3164047! chr2 92189814 95329940 S138 50 0.0166 3140126! chr2 92262183 95365848 S106 57 0.019 3103665! chr2 92264378 95357391 S138 39 0.0133 3093013! chr2 233336080 234631638 S138 2583 1.9953 1295558! chr2 184922511 185985238 S101 1983 1.8678 1062727! chr2 119980657 121036333 S106 1921 1.8216 1055676! chr2 184994526 186030052 S101 1956 1.8908 1035526! chr2 136161952 137038729 S106 1529 1.7462 876777! chr2 119952406 120805007 S106 1601 1.8801 852601! chr2 135168500 136012610 S106 1509 1.7901 844110! chr2 136266716 137041366 S106 1372 1.7737 774650! chr2 134764013 135515261 S101 1559 2.0779 751248! chr2 134821134 135536838 S101 1477 2.0665 715704! chr2 184390618 185105579 S138 1453 2.0351 714961! chr2 187780557 188481382 S101 1279 1.8278 700825! chr2 135162250 135837013 S106 1268 1.8821 674763 32
  28. Caveats when screening for runs of homozygosity 1. Difficult with

    exome data. Density of markers. 2. Shorter runs of homozygosity happen often by chance. 3. Density of homozygotes is important. http://www.nature.com/nature/journal/v449/n7164/extref/nature06258-s1.pdf 33