Identifying recessive candidates with GEMINI

Identifying recessive gene candidates with GEMINI Please refer to the
following Github Gist to ﬁnd each command for this session. Commands should be copy/pasted from this Gist Aaron Quinlan University of Utah ! ! ! ! ! quinlanlab.org 1 https://gist.github.com/arq5x/9e1928638397ba45da2e#ﬁle-autosomal-recessive-sh

Compound heterozygote detective work with GEMINI 2

Compound het refresher 3

Example compound heterozygote Dad Mom Kid G C G t
T a C C G a C G t 4

Phasing genotypes Jessica Chong 5

The result of phasing by transmission Jessica Chong 6

“Phasing” a VCF ﬁle by transmission with GATK Jessica Chong
Jessica Chong 7

G/G A/A A/G Jessica Chong 7

G/G A/A A/G A/G A/A A/G Jessica Chong 7

G/G A/A A/G A/G A/A A/G A/A A/G A/G Jessica Chong 7

G/G A/A A/G A/G A/A A/G A/A A/G A/G A/G A/G A/G ? Jessica Chong 7

“Phasing” a VCF ﬁle by transmission with GEMINI 8

“Phasing” a VCF ﬁle by transmission with GEMINI G/G A/A
A/G 8

A/G A/G A/A A/G 8

A/G A/G A/A A/G A/A A/G A/G 8

A/G A/G A/A A/G A/A A/G A/G A/G A/G A/G ? 8

Phasing by transmission C/C * Convention for phased genotype is
maternal allele ﬁrst A/G A/A A/G C/T C/T 9

Phasing by transmission C/C G|A C|T Both sites phasable: high
conﬁdence as deleterious alleles on different chromosomes * Convention for phased genotype is maternal allele ﬁrst A/G A/A A/G C/T C/T 9

conﬁdence as deleterious alleles on different chromosomes * Convention for phased genotype is maternal allele ﬁrst A/G A/A A/G C/T C/T C/T G/A A/A A/G C/C C/T 9

conﬁdence as deleterious alleles on different chromosomes * Convention for phased genotype is maternal allele ﬁrst G|A T|C Both sites phasable: yet exclude as deleterious alleles on same chromosomes A/G A/A A/G C/T C/T C/T G/A A/A A/G C/C C/T 9

conﬁdence as deleterious alleles on different chromosomes * Convention for phased genotype is maternal allele ﬁrst G|A T|C Both sites phasable: yet exclude as deleterious alleles on same chromosomes A/G A/A A/G C/T C/T C/T G/G A/A A/G C/T C/T C/T G/A A/A A/G C/C C/T 9

confidence as deleterious alleles on different chromosomes * Convention for phased genotype is maternal allele first G|A T|C Both sites phasable: yet exclude as deleterious alleles on same chromosomes G|A ? Only one site is phasable: lower confidence but cannot necessarily exclude. A/G A/A A/G C/T C/T C/T G/G A/A A/G C/T C/T C/T G/A A/A A/G C/C C/T 9

confidence as deleterious alleles on different chromosomes * Convention for phased genotype is maternal allele first G|A T|C Both sites phasable: yet exclude as deleterious alleles on same chromosomes G|A ? Only one site is phasable: lower confidence but cannot necessarily exclude. A/G A/A A/G C/T C/T C/T G/G A/A A/G C/T C/T C/T G/A A/A A/G C/C C/T A/G A/G A/G C/T C/T C/T 9

confidence as deleterious alleles on different chromosomes * Convention for phased genotype is maternal allele first G|A T|C Both sites phasable: yet exclude as deleterious alleles on same chromosomes G|A ? Only one site is phasable: lower confidence but cannot necessarily exclude. A/G A/A A/G C/T C/T C/T G/G A/A A/G C/T C/T C/T G/A A/A A/G C/C C/T A/G A/G A/G C/T C/T C/T ? ? Neither site is phasable: lower confidence but cannot necessarily exclude (recombination?). 9

Compound het test case Jessica Chong 10

The comp_hets tool in GEMINI http://gemini.readthedocs.org/en/latest/content/tools.html#comp-hets-identifying-potential-compound-heterozygotes Requires a PED ﬁle
#family_id sample_id paternal_id maternal_id sex phenotype family1 1805 -‐9 -‐9 2 1 family1 1847 -‐9 -‐9 1 1 family1 4805 1847 1805 1 2 4805 1847 1805 11

Create a GEMINI database from a VCF Notes: 1. The
VCF has been normalized and decomposed with VT 2. The VCF has been annotated with VEP. $ curl https://s3.amazonaws.com/gemini-‐tutorials/trio.trim.vep.vcf.gz > trio.trim.vep.vcf.gz $ curl https://s3.amazonaws.com/gemini-‐tutorials/recessive.ped > recessive.ped $ gemini load -‐-‐cores 4\ -‐v trio.trim.vep.vcf.gz \ -‐t VEP \ -‐-‐skip-‐gene-‐tables \ -‐p recessive.ped \ ! trio.trim.vep.recessive.db Note: copy and paste the full command from the Github Gist to avoid errors 4805 1847 1805 http://gemini.readthedocs.org/en/latest/content/preprocessing.html#step-1-split-left-align-and-trim-variants 12

Running the comp_hets tool. gemini comp_hets trio.trim.vep.recessive.db Note: copy and
paste the full command from the Github Gist 13

gemini comp_hets --columns "chrom, start, end, gene, impact, cadd_raw" trio.trim.vep.recessive.db
chrom start end gene impact cadd_raw variant_id family_id family_members family_genotypes samples family_count comp_het_id priority chr2 69702114 69702115 AAK1 UTR_3_prime -‐3.52 1638 family1 1805;unaffected,1847;unaffected,4805;affected G/C,G/C,G/C 4805 1 1_1638_1646 2 chr2 69870243 69870244 AAK1 UTR_5_prime 1.04 1646 family1 1805;unaffected,1847;unaffected,4805;affected G/G,G/A,G|A 4805 1 1_1638_1646 2 chr2 69702108 69702109 AAK1 UTR_3_prime -‐2.51 1637 family1 1805;unaffected,1847;unaffected,4805;affected A/C,A/C,A/C 4805 1 2_1637_1638 2 chr2 69702114 69702115 AAK1 UTR_3_prime -‐3.52 1638 family1 1805;unaffected,1847;unaffected,4805;affected G/C,G/C,G/C 4805 1 2_1637_1638 2 chr2 69702101 69702102 AAK1 UTR_3_prime 0.39 1636 family1 1805;unaffected,1847;unaffected,4805;affected T/T,T/T,T/C 4805 1 3_1636_1646 2 chr2 69870243 69870244 AAK1 UTR_5_prime 1.04 1646 family1 1805;unaffected,1847;unaffected,4805;affected G/G,G/A,G|A 4805 1 3_1636_1646 2 chr2 69702114 69702115 AAK1 UTR_3_prime -‐3.52 1638 family1 1805;unaffected,1847;unaffected,4805;affected G/C,G/C,G/C 4805 1 4_1638_1645 2 chr2 69870216 69870218 AAK1 UTR_5_prime None 1645 family1 1805;unaffected,1847;unaffected,4805;affected AT/A,AT/A,AT/A 4805 1 4_1638_1645 2 chr2 69702101 69702102 AAK1 UTR_3_prime 0.39 1636 family1 1805;unaffected,1847;unaffected,4805;affected T/T,T/T,T/C 4805 1 5_1636_1638 2 Again, we can limit the attributes returned w/ the --columns option. Note: copy and paste the full command from the Github Gist 14

Start with highest priority compound heterozygote candidates C/C A/G A/A
A/G C/T C/T 15

Start with highest priority compound heterozygote candidates C/C G|A C|T
Both sites phasable: high conﬁdence as deleterious alleles on different chromosomes A/G A/A A/G C/T C/T 15

Restrict to highest priority (i.e, priority==1) candidates $ gemini comp_hets
\ --columns "chrom, start, end, gene, impact, cadd_raw" \ trio.trim.vep.recessive.db \ | awk '$14==1' \ | head chrom start end gene impact cadd_raw variant_id family_id family_members family_genotypes samples family_count comp_het_id priority chr15 89398552 89398553 ACAN non_syn_coding 2.57 9519 family1 1805;unaffected,1847;unaffected,4805;affected C/A,C/C,A|C 4805 1 52_9519_9534 1 chr15 89417237 89417238 ACAN non_syn_coding -‐0.06 9534 family1 1805;unaffected,1847;unaffected,4805;affected A/A,A/G,A|G 4805 1 52_9519_9534 1 chr15 89398552 89398553 ACAN non_syn_coding 2.57 9519 family1 1805;unaffected,1847;unaffected,4805;affected C/A,C/C,A|C 4805 1 53_9519_9535 1 chr15 89417628 89417629 ACAN splice_region -‐0.18 9535 family1 1805;unaffected,1847;unaffected,4805;affected G/G,G/A,G|A 4805 1 53_9519_9535 1 chr2 111598957 111598958 ACOXL non_syn_coding 0.95 3247 family1 1805;unaffected,1847;unaffected,4805;affected C/C,C/T,C|T 4805 1 86_3247_3251 1 chr2 111850514 111850515 ACOXL non_syn_coding 3.17 3251 family1 1805;unaffected,1847;unaffected,4805;affected C/T,C/C,T|C 4805 1 86_3247_3251 1 chr17 55182877 55182878 AKAP1 non_syn_coding 2.57 13305 family1 1805;unaffected,1847;unaffected,4805;affected C/T,C/C,T|C 4805 1 104_13305_13308 1 chr17 55183316 55183317 AKAP1 synonymous_coding 0.43 13308 family1 1805;unaffected,1847;unaffected,4805;affected G/G,G/A,G|A 4805 1 104_13305_13308 1 chr2 29448409 29448410 ALK non_syn_coding 1.49 839 family1 1805;unaffected,1847;unaffected,4805;affected T/T,T/G,T|G 4805 1 186_839_841 1 chr2 29455198 29455199 ALK non_syn_coding 3.51 841 family1 1805;unaffected,1847;unaffected,4805;affected A/T,A/A,T|A 4805 1 186_839_841 1 $ gemini comp_hets \ --columns "chrom, start, end, gene, impact, cadd_raw" \ trio.trim.vep.recessive.db \ | awk '$14==1' \ | wc -l 612 lines Each compund heterozygote is a set of two lines, so we have 306 (612 / 2) compound heterozygote candidates Note: copy and paste the full command from the Github Gist 16

So many candidates. Time to start --filtering! $ gemini comp_hets
\ --columns "chrom, start, end, gene, impact, cadd_raw" \ --filter "impact_severity != 'LOW'" \ trio.trim.vep.recessive.db \ | awk '$14==1' \ | wc -l 260 lines (130 comp_hets) Note: copy and paste the full command from the Github Gist 17

$ gemini comp_hets \ --columns "chrom, start, end, gene, impact,
cadd_raw" \ --filter "impact_severity != 'LOW' \ and ((aaf_esp_ea <= 0.01 or aaf_esp_ea is NULL) \ and (aaf_exac_all <= 0.01 or aaf_exac_all is NULL))” \ trio.trim.vep.recessive.db \ | awk '$14==1' \ | wc -l Use ESP and ExAC to focus on rare variants 8 lines, 4 comp_hets Note: copy and paste the full command from the Github Gist 18

$ gemini comp_hets \ --columns "chrom, start, end, gene, impact,
cadd_raw" \ --filter "impact_severity != 'LOW' \ and ((aaf_esp_ea <= 0.01 or aaf_esp_ea is NULL) \ and (aaf_exac_all <= 0.01 or aaf_exac_all is NULL))” \ trio.trim.vep.recessive.db \ | awk '$14==1' Use ESP and ExAC to focus on rare variants chr17 78081692 78081693 GAA non_syn_coding 4.33 14401 family1 1805;unaffected,1847;unaffected4805;affected T/T,T/C,T|C 4805 1 16_14401_14406 1 chr17 78084748 78084749 GAA non_syn_coding 5.51 14406 family1 1805;unaffected,1847;unaffected4805;affected G/A,G/G,A|G 4805 1 16_14401_14406 1 chr2 129025757 129025758 HS6ST1 non_syn_coding 2.67 3657 family1 1805;unaffected,1847;unaffected4805;affected C/A,C/C,A|C 4805 1 45_3657_3660 1 chr2 129075743 129075744 HS6ST1 non_syn_coding 3.25 3660 family1 1805;unaffected,1847;unaffected4805;affected G/G,G/C,G|C 4805 1 45_3657_3660 1 chr22 45255643 45255644 PRR5-‐ARHGAP8 non_syn_coding 1.88 16838 family1 1805;unaffected,1847;unaffected4805;affected G/G,G/T,G|T 4805 1 169_16838_16839 1 chr22 45255687 45255688 PRR5-‐ARHGAP8 non_syn_coding 0.37 16839 family1 1805;unaffected,1847;unaffected4805;affected C/T,C/C,T|C 4805 1 169_16838_16839 1 chr15 71548994 71548995 THSD4 non_syn_coding 3.68 8777 family1 1805;unaffected,1847;unaffected4805;affected C/C,C/T,C|T 4805 1 181_8777_8780 1 chr15 71952898 71952899 THSD4 non_syn_coding 4.52 8780 family1 1805;unaffected,1847;unaffected4805;affected G/A,G/G,A|G 4805 1 181_8777_8780 1 Note: copy and paste the full command from the Github Gist 19

Load the following files into IGV (Load from URL) and
inspect your candidates BAM alignment files: ! https://s3.amazonaws.com/gemini-‐tutorials/1805.workshop.bam https://s3.amazonaws.com/gemini-‐tutorials/1847.workshop.bam https://s3.amazonaws.com/gemini-‐tutorials/4805.workshop.bam VCF variant file: ! https://s3.amazonaws.com/gemini-‐tutorials/trio.trim.vep.vcf.gz ! 20

Finding recessive genes with GEMINI assuming consanguinuity 21

The autosomal_recessive tool. http://gemini.readthedocs.org/en/latest/content/tools.html#autosomal-recessive-ﬁnd-variants-meeting-an-autosomal-recessive-model 22

The autosomal_recessive tool. Default behavior: 23

The autosomal_recessive tool. The -‐-‐min-‐kindreds option: This speciﬁes the number
of families required to have a variant in the same gene in order for it to be reported. For example, we may only be interested in candidates where at least 2 families have a variant in that gene. 24

The autosomal_recessive tool. -‐-‐filter for variants with potential functional consequence:
25

The autosomal_recessive tool: other options The —gt-‐pl-‐max option: In order
to eliminate less conﬁdent genotypes, it is possible to enforce a maximum PL value for each sample. On this scale, lower values indicate more conﬁdence that the called genotype is correct. 10 is a reasonable value: What is the “PL”? https://samtools.github.io/hts-specs/VCFv4.2.pdf What is a “Phred scaled” genotype likelihood? 26

The autosomal_recessive tool: other options What is a “Phred scaled”
genotype likelihood? Example calculation based on the GATK HaplotypeCaller http://gatkforums.broadinstitute.org/discussion/5913/math-notes-how-pl-is-calculated-in-haplotypecaller 27

Runs of homozygosity Method 1: intersecting with previously known regions
28

bgzip homoz_region.bed tabix -p bed homoz_region.bed.gz Intersect with observed homozygosity
region(s) (Example commands) 1. Tabix a BED ﬁle with the observed homozygosity regions gemini annotate -f homoz_region.bed.gz \ –c homoz_region \ -t boolean \ AR.db 2. Use the annotate tool to ﬂag variants that overlap these regions. gemini autosomal_recessive AR.db --columns "chrom, start, end, ref, alt, filter, qual, gene, impact, aaf_esp_ea, aaf_1kg_eur” -–filter "filter is NULL and aaf_esp_ea < 0.1 and (impact_severity = 'HIGH' or impact_severity = 'MED') and region ==1” 3. Filter variants for those that overlap these regions. 29

Runs of homozygosity Method 2: search for runs of homozygosity
30

Intersect with observed homozygosity region(s) gemini roh AR.db Run the
roh tool to search for candidate runs of homozygosity. 31

Intersect with observed homozygosity region(s) gemini roh AR.db | sort
-k7nr Run the roh tool to search for candidate runs of homozygosity. chrom start end sample num_of_snps density_per_kb run_length_in_bp! chr2 92134506 95353861 S101 151 0.0475 3219355! chr2 92188685 95359306 S101 75 0.0243 3170621! chr2 92189814 95353861 S106 68 0.0221 3164047! chr2 92189814 95329940 S138 50 0.0166 3140126! chr2 92262183 95365848 S106 57 0.019 3103665! chr2 92264378 95357391 S138 39 0.0133 3093013! chr2 233336080 234631638 S138 2583 1.9953 1295558! chr2 184922511 185985238 S101 1983 1.8678 1062727! chr2 119980657 121036333 S106 1921 1.8216 1055676! chr2 184994526 186030052 S101 1956 1.8908 1035526! chr2 136161952 137038729 S106 1529 1.7462 876777! chr2 119952406 120805007 S106 1601 1.8801 852601! chr2 135168500 136012610 S106 1509 1.7901 844110! chr2 136266716 137041366 S106 1372 1.7737 774650! chr2 134764013 135515261 S101 1559 2.0779 751248! chr2 134821134 135536838 S101 1477 2.0665 715704! chr2 184390618 185105579 S138 1453 2.0351 714961! chr2 187780557 188481382 S101 1279 1.8278 700825! chr2 135162250 135837013 S106 1268 1.8821 674763 32

Caveats when screening for runs of homozygosity 1. Difﬁcult with
exome data. Density of markers. 2. Shorter runs of homozygosity happen often by chance. 3. Density of homozygotes is important. http://www.nature.com/nature/journal/v449/n7164/extref/nature06258-s1.pdf 33

Identifying recessive candidates with GEMINI

Identifying recessive candidates with GEMINI

More Decks by Aaron Quinlan

Other Decks in Science

Featured

Transcript