Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Identifying recessive candidates with GEMINI

Aaron Quinlan
August 19, 2015

Identifying recessive candidates with GEMINI

Aaron Quinlan

August 19, 2015
Tweet

More Decks by Aaron Quinlan

Other Decks in Science

Transcript

  1. Identifying recessive gene
    candidates with GEMINI
    Please refer to the following Github Gist to find each command for this session.
    Commands should be copy/pasted from this Gist
    Aaron Quinlan
    University of Utah
    !
    !
    !
    !
    !
    quinlanlab.org
    1
    https://gist.github.com/arq5x/9e1928638397ba45da2e#file-autosomal-recessive-sh

    View Slide

  2. Compound heterozygote
    detective work with GEMINI
    2

    View Slide

  3. Compound het refresher
    3

    View Slide

  4. Example compound heterozygote
    Dad Mom
    Kid
    G
    C
    G
    t
    T
    a
    C C
    G
    a
    C
    G
    t
    4

    View Slide

  5. Phasing genotypes
    Jessica Chong
    5

    View Slide

  6. The result of phasing by transmission
    Jessica Chong
    6

    View Slide

  7. “Phasing” a VCF file by transmission with GATK
    Jessica Chong
    Jessica Chong
    7

    View Slide

  8. “Phasing” a VCF file by transmission with GATK
    Jessica Chong
    G/G A/A
    A/G
    Jessica Chong
    7

    View Slide

  9. “Phasing” a VCF file by transmission with GATK
    Jessica Chong
    G/G A/A
    A/G
    A/G A/A
    A/G
    Jessica Chong
    7

    View Slide

  10. “Phasing” a VCF file by transmission with GATK
    Jessica Chong
    G/G A/A
    A/G
    A/G A/A
    A/G
    A/A A/G
    A/G
    Jessica Chong
    7

    View Slide

  11. “Phasing” a VCF file by transmission with GATK
    Jessica Chong
    G/G A/A
    A/G
    A/G A/A
    A/G
    A/A A/G
    A/G
    A/G A/G
    A/G ?
    Jessica Chong
    7

    View Slide

  12. “Phasing” a VCF file by transmission with GEMINI
    8

    View Slide

  13. “Phasing” a VCF file by transmission with GEMINI
    G/G A/A
    A/G
    8

    View Slide

  14. “Phasing” a VCF file by transmission with GEMINI
    G/G A/A
    A/G
    A/G A/A
    A/G
    8

    View Slide

  15. “Phasing” a VCF file by transmission with GEMINI
    G/G A/A
    A/G
    A/G A/A
    A/G
    A/A A/G
    A/G
    8

    View Slide

  16. “Phasing” a VCF file by transmission with GEMINI
    G/G A/A
    A/G
    A/G A/A
    A/G
    A/A A/G
    A/G
    A/G A/G
    A/G ?
    8

    View Slide

  17. Phasing by transmission
    C/C
    * Convention for phased genotype is maternal allele first
    A/G
    A/A
    A/G
    C/T
    C/T
    9

    View Slide

  18. Phasing by transmission
    C/C
    G|A C|T
    Both sites phasable: high confidence as
    deleterious alleles on different chromosomes
    * Convention for phased genotype is maternal allele first
    A/G
    A/A
    A/G
    C/T
    C/T
    9

    View Slide

  19. Phasing by transmission
    C/C
    G|A C|T
    Both sites phasable: high confidence as
    deleterious alleles on different chromosomes
    * Convention for phased genotype is maternal allele first
    A/G
    A/A
    A/G
    C/T
    C/T
    C/T
    G/A
    A/A
    A/G
    C/C
    C/T
    9

    View Slide

  20. Phasing by transmission
    C/C
    G|A C|T
    Both sites phasable: high confidence as
    deleterious alleles on different chromosomes
    * Convention for phased genotype is maternal allele first
    G|A T|C
    Both sites phasable: yet exclude as
    deleterious alleles on same chromosomes
    A/G
    A/A
    A/G
    C/T
    C/T
    C/T
    G/A
    A/A
    A/G
    C/C
    C/T
    9

    View Slide

  21. Phasing by transmission
    C/C
    G|A C|T
    Both sites phasable: high confidence as
    deleterious alleles on different chromosomes
    * Convention for phased genotype is maternal allele first
    G|A T|C
    Both sites phasable: yet exclude as
    deleterious alleles on same chromosomes
    A/G
    A/A
    A/G
    C/T
    C/T
    C/T
    G/G
    A/A
    A/G
    C/T
    C/T
    C/T
    G/A
    A/A
    A/G
    C/C
    C/T
    9

    View Slide

  22. Phasing by transmission
    C/C
    G|A C|T
    Both sites phasable: high confidence as
    deleterious alleles on different chromosomes
    * Convention for phased genotype is maternal allele first
    G|A T|C
    Both sites phasable: yet exclude as
    deleterious alleles on same chromosomes
    G|A ?
    Only one site is phasable: lower confidence
    but cannot necessarily exclude.
    A/G
    A/A
    A/G
    C/T
    C/T
    C/T
    G/G
    A/A
    A/G
    C/T
    C/T
    C/T
    G/A
    A/A
    A/G
    C/C
    C/T
    9

    View Slide

  23. Phasing by transmission
    C/C
    G|A C|T
    Both sites phasable: high confidence as
    deleterious alleles on different chromosomes
    * Convention for phased genotype is maternal allele first
    G|A T|C
    Both sites phasable: yet exclude as
    deleterious alleles on same chromosomes
    G|A ?
    Only one site is phasable: lower confidence
    but cannot necessarily exclude.
    A/G
    A/A
    A/G
    C/T
    C/T
    C/T
    G/G
    A/A
    A/G
    C/T
    C/T
    C/T
    G/A
    A/A
    A/G
    C/C
    C/T
    A/G
    A/G
    A/G
    C/T C/T
    C/T
    9

    View Slide

  24. Phasing by transmission
    C/C
    G|A C|T
    Both sites phasable: high confidence as
    deleterious alleles on different chromosomes
    * Convention for phased genotype is maternal allele first
    G|A T|C
    Both sites phasable: yet exclude as
    deleterious alleles on same chromosomes
    G|A ?
    Only one site is phasable: lower confidence
    but cannot necessarily exclude.
    A/G
    A/A
    A/G
    C/T
    C/T
    C/T
    G/G
    A/A
    A/G
    C/T
    C/T
    C/T
    G/A
    A/A
    A/G
    C/C
    C/T
    A/G
    A/G
    A/G
    C/T C/T
    C/T
    ? ?
    Neither site is phasable: lower confidence but
    cannot necessarily exclude (recombination?).
    9

    View Slide

  25. Compound het test case
    Jessica Chong
    10

    View Slide

  26. The comp_hets tool in GEMINI
    http://gemini.readthedocs.org/en/latest/content/tools.html#comp-hets-identifying-potential-compound-heterozygotes
    Requires a PED file
    #family_id    sample_id    paternal_id    maternal_id    sex    phenotype    
    family1          1805              -­‐9                      -­‐9                      2        1                    
    family1          1847              -­‐9                      -­‐9                      1        1                    
    family1          4805              1847                  1805                  1        2                  
    4805
    1847 1805
    11

    View Slide

  27. Create a GEMINI database from a VCF
    Notes:
    1. The VCF has been normalized and decomposed with VT
    2. The VCF has been annotated with VEP.
    $  curl  https://s3.amazonaws.com/gemini-­‐tutorials/trio.trim.vep.vcf.gz  >  trio.trim.vep.vcf.gz  
    $  curl  https://s3.amazonaws.com/gemini-­‐tutorials/recessive.ped  >  recessive.ped  
    $  gemini  load  -­‐-­‐cores  4\  
                               -­‐v  trio.trim.vep.vcf.gz  \  
                               -­‐t  VEP  \  
                               -­‐-­‐skip-­‐gene-­‐tables  \  
                               -­‐p  recessive.ped  \  
    !
                     trio.trim.vep.recessive.db
    Note: copy and paste the full command from the Github Gist to avoid errors
    4805
    1847 1805
    http://gemini.readthedocs.org/en/latest/content/preprocessing.html#step-1-split-left-align-and-trim-variants
    12

    View Slide

  28. Running the comp_hets tool.
    gemini comp_hets trio.trim.vep.recessive.db
    Note: copy and paste the full command from the Github Gist
    13

    View Slide

  29. gemini comp_hets --columns
    "chrom, start, end, gene,
    impact, cadd_raw"
    trio.trim.vep.recessive.db
    chrom    start            end                gene      impact                  cadd_raw    variant_id    family_id    family_members                                                                family_genotypes          samples    family_count    comp_het_id              priority  
    chr2      69702114      69702115      AAK1      UTR_3_prime        -­‐3.52          1638                family1        1805;unaffected,1847;unaffected,4805;affected    G/C,G/C,G/C                  4805          1                          1_1638_1646              2  
    chr2      69870243      69870244      AAK1      UTR_5_prime        1.04            1646                family1        1805;unaffected,1847;unaffected,4805;affected    G/G,G/A,G|A                  4805          1                          1_1638_1646              2  
    chr2      69702108      69702109      AAK1      UTR_3_prime        -­‐2.51          1637                family1        1805;unaffected,1847;unaffected,4805;affected    A/C,A/C,A/C                  4805          1                          2_1637_1638              2  
    chr2      69702114      69702115      AAK1      UTR_3_prime        -­‐3.52          1638                family1        1805;unaffected,1847;unaffected,4805;affected    G/C,G/C,G/C                  4805          1                          2_1637_1638              2  
    chr2      69702101      69702102      AAK1      UTR_3_prime        0.39            1636                family1        1805;unaffected,1847;unaffected,4805;affected    T/T,T/T,T/C                  4805          1                          3_1636_1646              2  
    chr2      69870243      69870244      AAK1      UTR_5_prime        1.04            1646                family1        1805;unaffected,1847;unaffected,4805;affected    G/G,G/A,G|A                  4805          1                          3_1636_1646              2  
    chr2      69702114      69702115      AAK1      UTR_3_prime        -­‐3.52          1638                family1        1805;unaffected,1847;unaffected,4805;affected    G/C,G/C,G/C                  4805          1                          4_1638_1645              2  
    chr2      69870216      69870218      AAK1      UTR_5_prime        None            1645                family1        1805;unaffected,1847;unaffected,4805;affected    AT/A,AT/A,AT/A            4805          1                          4_1638_1645              2  
    chr2      69702101      69702102      AAK1      UTR_3_prime        0.39            1636                family1        1805;unaffected,1847;unaffected,4805;affected    T/T,T/T,T/C                  4805          1                          5_1636_1638              2
    Again, we can limit the attributes returned w/ the --columns option.
    Note: copy and paste the full command from the Github Gist
    14

    View Slide

  30. Start with highest priority compound heterozygote candidates
    C/C
    A/G
    A/A
    A/G
    C/T
    C/T
    15

    View Slide

  31. Start with highest priority compound heterozygote candidates
    C/C
    G|A C|T
    Both sites phasable: high confidence as
    deleterious alleles on different chromosomes
    A/G
    A/A
    A/G
    C/T
    C/T
    15

    View Slide

  32. Restrict to highest priority (i.e, priority==1) candidates
    $ gemini comp_hets \
    --columns "chrom, start, end, gene, impact, cadd_raw" \
    trio.trim.vep.recessive.db \
    | awk '$14==1' \
    | head
    chrom    start            end                gene      impact                          cadd_raw    variant_id    family_id    family_members                                                                  family_genotypes      samples    family_count    comp_het_id            priority  
    chr15    89398552      89398553      ACAN      non_syn_coding          2.57            9519                family1        1805;unaffected,1847;unaffected,4805;affected    C/A,C/C,A|C                4805          1                          52_9519_9534          1  
    chr15    89417237      89417238      ACAN      non_syn_coding          -­‐0.06          9534                family1        1805;unaffected,1847;unaffected,4805;affected    A/A,A/G,A|G                4805          1                          52_9519_9534          1  
    chr15    89398552      89398553      ACAN      non_syn_coding          2.57            9519                family1        1805;unaffected,1847;unaffected,4805;affected    C/A,C/C,A|C                4805          1                          53_9519_9535          1  
    chr15    89417628      89417629      ACAN      splice_region            -­‐0.18          9535                family1        1805;unaffected,1847;unaffected,4805;affected    G/G,G/A,G|A                4805          1                          53_9519_9535          1  
    chr2      111598957    111598958    ACOXL    non_syn_coding          0.95            3247                family1        1805;unaffected,1847;unaffected,4805;affected    C/C,C/T,C|T                4805          1                          86_3247_3251          1  
    chr2      111850514    111850515    ACOXL    non_syn_coding          3.17            3251                family1        1805;unaffected,1847;unaffected,4805;affected    C/T,C/C,T|C                4805          1                          86_3247_3251          1  
    chr17    55182877      55182878      AKAP1    non_syn_coding          2.57            13305              family1        1805;unaffected,1847;unaffected,4805;affected    C/T,C/C,T|C                4805          1                          104_13305_13308    1  
    chr17    55183316      55183317      AKAP1    synonymous_coding    0.43            13308              family1        1805;unaffected,1847;unaffected,4805;affected    G/G,G/A,G|A                4805          1                          104_13305_13308    1  
    chr2      29448409      29448410      ALK        non_syn_coding          1.49            839                  family1        1805;unaffected,1847;unaffected,4805;affected    T/T,T/G,T|G                4805          1                          186_839_841            1  
    chr2      29455198      29455199      ALK        non_syn_coding          3.51            841                  family1        1805;unaffected,1847;unaffected,4805;affected    A/T,A/A,T|A                4805          1                          186_839_841            1
    $ gemini comp_hets \
    --columns "chrom, start, end, gene, impact, cadd_raw" \
    trio.trim.vep.recessive.db \
    | awk '$14==1' \
    | wc -l
    612  lines
    Each compund heterozygote is a set of two lines, so we have 306 (612 /
    2) compound heterozygote candidates
    Note: copy and paste the full command from the Github Gist
    16

    View Slide

  33. So many candidates. Time to start --filtering!
    $ gemini comp_hets \
    --columns "chrom, start, end, gene, impact, cadd_raw" \
    --filter "impact_severity != 'LOW'" \
    trio.trim.vep.recessive.db \
    | awk '$14==1' \
    | wc -l
    260  lines  (130  comp_hets)
    Note: copy and paste the full command from the Github Gist
    17

    View Slide

  34. $ gemini comp_hets \
    --columns "chrom, start, end, gene, impact, cadd_raw" \
    --filter "impact_severity != 'LOW' \
    and ((aaf_esp_ea <= 0.01 or aaf_esp_ea is NULL) \
    and (aaf_exac_all <= 0.01 or aaf_exac_all is NULL))” \
    trio.trim.vep.recessive.db \
    | awk '$14==1' \
    | wc -l
    Use ESP and ExAC to focus on rare variants
    8  lines,  4  comp_hets
    Note: copy and paste the full command from the Github Gist
    18

    View Slide

  35. $ gemini comp_hets \
    --columns "chrom, start, end, gene, impact, cadd_raw" \
    --filter "impact_severity != 'LOW' \
    and ((aaf_esp_ea <= 0.01 or aaf_esp_ea is NULL) \
    and (aaf_exac_all <= 0.01 or aaf_exac_all is NULL))” \
    trio.trim.vep.recessive.db \
    | awk '$14==1'
    Use ESP and ExAC to focus on rare variants
    chr17    78081692      78081693      GAA                      non_syn_coding    4.33    14401    family1    1805;unaffected,1847;unaffected4805;affected    T/T,T/C,T|C    4805    1    16_14401_14406      1  
    chr17    78084748      78084749      GAA                      non_syn_coding    5.51    14406    family1    1805;unaffected,1847;unaffected4805;affected    G/A,G/G,A|G    4805    1    16_14401_14406      1  
    chr2      129025757    129025758    HS6ST1                non_syn_coding    2.67    3657      family1    1805;unaffected,1847;unaffected4805;affected    C/A,C/C,A|C    4805    1    45_3657_3660          1  
    chr2      129075743    129075744    HS6ST1                non_syn_coding    3.25    3660      family1    1805;unaffected,1847;unaffected4805;affected    G/G,G/C,G|C    4805    1    45_3657_3660          1  
    chr22    45255643      45255644      PRR5-­‐ARHGAP8    non_syn_coding    1.88    16838    family1    1805;unaffected,1847;unaffected4805;affected    G/G,G/T,G|T    4805    1    169_16838_16839    1  
    chr22    45255687      45255688      PRR5-­‐ARHGAP8    non_syn_coding    0.37    16839    family1    1805;unaffected,1847;unaffected4805;affected    C/T,C/C,T|C    4805    1    169_16838_16839    1  
    chr15    71548994      71548995      THSD4                  non_syn_coding    3.68    8777      family1    1805;unaffected,1847;unaffected4805;affected    C/C,C/T,C|T    4805    1    181_8777_8780        1  
    chr15    71952898      71952899      THSD4                  non_syn_coding    4.52    8780      family1    1805;unaffected,1847;unaffected4805;affected    G/A,G/G,A|G    4805    1    181_8777_8780        1
    Note: copy and paste the full command from the Github Gist
    19

    View Slide

  36. Load the following files into IGV (Load from URL) and inspect your candidates
    BAM alignment files:
    !
    https://s3.amazonaws.com/gemini-­‐tutorials/1805.workshop.bam  
    https://s3.amazonaws.com/gemini-­‐tutorials/1847.workshop.bam  
    https://s3.amazonaws.com/gemini-­‐tutorials/4805.workshop.bam
    VCF variant file:
    !
    https://s3.amazonaws.com/gemini-­‐tutorials/trio.trim.vep.vcf.gz  
    !
    20

    View Slide

  37. Finding recessive genes with
    GEMINI assuming
    consanguinuity
    21

    View Slide

  38. The autosomal_recessive tool.
    http://gemini.readthedocs.org/en/latest/content/tools.html#autosomal-recessive-find-variants-meeting-an-autosomal-recessive-model
    22

    View Slide

  39. The autosomal_recessive tool.
    Default behavior:
    23

    View Slide

  40. The autosomal_recessive tool.
    The -­‐-­‐min-­‐kindreds  option:
    This specifies the number of families required to have a variant in
    the same gene in order for it to be reported. For example, we may
    only be interested in candidates where at least 2 families have a
    variant in that gene.
    24

    View Slide

  41. The autosomal_recessive tool.
    -­‐-­‐filter  for variants with potential functional consequence:
    25

    View Slide

  42. The autosomal_recessive tool: other options
    The —gt-­‐pl-­‐max  option:
    In order to eliminate less confident genotypes, it is possible to
    enforce a maximum PL value for each sample. On this scale, lower
    values indicate more confidence that the called genotype is
    correct. 10 is a reasonable value:
    What is the “PL”?
    https://samtools.github.io/hts-specs/VCFv4.2.pdf
    What is a “Phred scaled” genotype likelihood?
    26

    View Slide

  43. The autosomal_recessive tool: other options
    What is a “Phred scaled” genotype likelihood? Example calculation based on the GATK HaplotypeCaller
    http://gatkforums.broadinstitute.org/discussion/5913/math-notes-how-pl-is-calculated-in-haplotypecaller
    27

    View Slide

  44. Runs of homozygosity
    Method 1: intersecting with
    previously known regions
    28

    View Slide

  45. bgzip homoz_region.bed
    tabix -p bed homoz_region.bed.gz
    Intersect with observed homozygosity region(s)
    (Example commands)
    1. Tabix a BED file with the observed homozygosity regions
    gemini annotate -f homoz_region.bed.gz \
    –c homoz_region \
    -t boolean \
    AR.db
    2. Use the annotate tool to flag variants that overlap these regions.
    gemini autosomal_recessive AR.db --columns "chrom, start, end,
    ref, alt, filter, qual, gene, impact, aaf_esp_ea, aaf_1kg_eur”
    -–filter "filter is NULL and aaf_esp_ea < 0.1 and
    (impact_severity = 'HIGH' or impact_severity = 'MED') and
    region ==1”
    3. Filter variants for those that overlap these regions.
    29

    View Slide

  46. Runs of homozygosity
    Method 2: search for runs of
    homozygosity
    30

    View Slide

  47. Intersect with observed homozygosity region(s)
    gemini roh AR.db
    Run the roh tool to search for candidate runs of homozygosity.
    31

    View Slide

  48. Intersect with observed homozygosity region(s)
    gemini roh AR.db | sort -k7nr
    Run the roh tool to search for candidate runs of homozygosity.
    chrom start end sample num_of_snps density_per_kb run_length_in_bp!
    chr2 92134506 95353861 S101 151 0.0475 3219355!
    chr2 92188685 95359306 S101 75 0.0243 3170621!
    chr2 92189814 95353861 S106 68 0.0221 3164047!
    chr2 92189814 95329940 S138 50 0.0166 3140126!
    chr2 92262183 95365848 S106 57 0.019 3103665!
    chr2 92264378 95357391 S138 39 0.0133 3093013!
    chr2 233336080 234631638 S138 2583 1.9953 1295558!
    chr2 184922511 185985238 S101 1983 1.8678 1062727!
    chr2 119980657 121036333 S106 1921 1.8216 1055676!
    chr2 184994526 186030052 S101 1956 1.8908 1035526!
    chr2 136161952 137038729 S106 1529 1.7462 876777!
    chr2 119952406 120805007 S106 1601 1.8801 852601!
    chr2 135168500 136012610 S106 1509 1.7901 844110!
    chr2 136266716 137041366 S106 1372 1.7737 774650!
    chr2 134764013 135515261 S101 1559 2.0779 751248!
    chr2 134821134 135536838 S101 1477 2.0665 715704!
    chr2 184390618 185105579 S138 1453 2.0351 714961!
    chr2 187780557 188481382 S101 1279 1.8278 700825!
    chr2 135162250 135837013 S106 1268 1.8821 674763
    32

    View Slide

  49. Caveats when screening for runs of homozygosity
    1. Difficult with exome data. Density of markers.
    2. Shorter runs of homozygosity happen often by chance.
    3. Density of homozygotes is important.
    http://www.nature.com/nature/journal/v449/n7164/extref/nature06258-s1.pdf
    33

    View Slide