Slide 1

Slide 1 text

Identifying de novo mutations with GEMINI Please refer to the following Github Gist to find each command for this session. Commands should be copy/pasted from this Gist Aaron Quinlan University of Utah ! ! ! ! ! quinlanlab.org 1 https://gist.github.com/arq5x/9e1928638397ba45da2e#file-denovo-sh

Slide 2

Slide 2 text

Automated tools for disease inheritance models 2

Slide 3

Slide 3 text

Automated tools for disease inheritance models 2

Slide 4

Slide 4 text

Automated tools for disease inheritance models 3

Slide 5

Slide 5 text

Common options for disease model tools. 4

Slide 6

Slide 6 text

Why search for de novo mutations? Brian O’Roak 5

Slide 7

Slide 7 text

High impact variants Brian O’Roak 6

Slide 8

Slide 8 text

De novo mutations 7

Slide 9

Slide 9 text

How many de novo mutations should we expect? 8

Slide 10

Slide 10 text

De novo mutations (rough expectations) 9

Slide 11

Slide 11 text

In practice, it’s not so simple. Brian O’Roak 10

Slide 12

Slide 12 text

11

Slide 13

Slide 13 text

Why are there so many artifacts? • Prior probabilities - the more interesting something is, the less likely it is to be real ! • If something can go wrong, it will. • Incorrect genotype assignment • Low coverage in one or more of the individuals in the family (especially the parents…why?) • Mismapping • Misalignment • Paralogy • Systematic artifacts • Somatic events 12

Slide 14

Slide 14 text

Detective work with GEMINI 13

Slide 15

Slide 15 text

The de_novo tool in GEMINI http://gemini.readthedocs.org/en/latest/content/tools.html#de-novo-identifying-potential-de-novo-mutations 14

Slide 16

Slide 16 text

Create a GEMINI database from a VCF Notes: 1. The VCF has been normalized and decomposed with VT 2. The VCF has been annotated with VEP. $  curl  https://s3.amazonaws.com/gemini-­‐tutorials/trio.trim.vep.vcf.gz  >  trio.trim.vep.vcf.gz   $  curl  https://s3.amazonaws.com/gemini-­‐tutorials/denovo.ped  >  denovo.ped   $  gemini  load  -­‐-­‐cores  4  \                              -­‐v  trio.trim.vep.vcf.gz  \                              -­‐t  VEP  \                              -­‐-­‐skip-­‐gene-­‐tables  -­‐-­‐skip-­‐cadd  -­‐-­‐skip-­‐gerp-­‐bp  \                              -­‐p  de_novo.ped  \   !                  trio.trim.vep.denovo.db Note: copy and paste the full command from the Github Gist to avoid errors ~8 minutes http://gemini.readthedocs.org/en/latest/content/preprocessing.html#step-1-split-left-align-and-trim-variants 15

Slide 17

Slide 17 text

Normalization and decomposition are required preprocessing steps Variant decomposition http://genome.sph.umich.edu/wiki/Vt#Decompose Variant normalization http://genome.sph.umich.edu/wiki/File:Normalization_mnp.png http://gemini.readthedocs.org/en/latest/ content/preprocessing.html#preprocessing- and-loading-a-vcf-file-into-gemini Details can be found in the GEMINI documentation
 16

Slide 18

Slide 18 text

Running the de_novo tool  $  gemini  de_novo  trio.trim.vep.denovo.db Note: copy and paste the full command from the Github Gist 17

Slide 19

Slide 19 text

Information overload There are currently 115 columns in the variants table. Perhaps a bit of overkill for a typical analysis http://gemini.readthedocs.org/en/latest/content/database_schema.html#the-variants-table 18

Slide 20

Slide 20 text

Limit the attributes returned w/ the -­‐-­‐columns option.  $  gemini  de_novo  \        -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \                              filter,  qual,  gene,  impact"  \        trio.trim.vep.denovo.db Note: copy and paste the full command from the Github Gist 19

Slide 21

Slide 21 text

Limit the attributes returned w/ the -­‐-­‐columns option. http://gemini.readthedocs.org/en/latest/content/tools.html#common-args-common-arguments  $  gemini  de_novo  \        -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \                              filter,  qual,  gene,  impact"  \        trio.trim.vep.denovo.db Note: copy and paste the full command from the Github Gist 20

Slide 22

Slide 22 text

Better, but there are still so many (likely false) candidates.  $  gemini  de_novo  \        -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \                              filter,  qual,  gene,  impact"  \        trio.trim.vep.denovo.db  |  wc  -­‐l Note: copy and paste the full command from the Github Gist 771  candidates! 21

Slide 23

Slide 23 text

Causes of erroneous genotype predictions: lack of depth 22

Slide 24

Slide 24 text

Let’s enforce a minimum sequence depth for each subject: -­‐d  $  gemini  de_novo  \        -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \                              filter,  qual,  gene,  impact"  \        -­‐d  15  \        trio.trim.vep.denovo.db  |  wc  -­‐l Note: copy and paste the full command from the Github Gist 676  candidates 23

Slide 25

Slide 25 text

Causes of erroneous genotype predictions: low quality variants 24

Slide 26

Slide 26 text

Require that the mutation passes GATK QC with -­‐-­‐filter  $  gemini  de_novo  \        -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \                              filter,  qual,  gene,  impact"  \        -­‐d  15  \        -­‐-­‐filter  "filter  is  NULL"  \        trio.trim.vep.denovo.db  |  wc  -­‐l Note: copy and paste the full command from the Github Gist 55  candidates 25

Slide 27

Slide 27 text

Require that the mutation is likely to have functional consequence 26

Slide 28

Slide 28 text

Require that the mutation is likely to have functional consequence  $  gemini  de_novo  \        -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \                              filter,  qual,  gene,  impact"  \        -­‐d  15  \        -­‐-­‐filter  "filter  is  NULL  and  impact_severity  !=  ‘LOW’”  \        trio.trim.vep.denovo.db  |  wc  -­‐l Note: copy and paste the full command from the Github Gist 13  candidates 27

Slide 29

Slide 29 text

Require that the mutation is not likely to be a known polymorphism 28

Slide 30

Slide 30 text

Require that the mutation is not likely to be a known polymorphism Note: copy and paste the full command from the Github Gist  $  gemini  de_novo  \        -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \                              filter,  qual,  gene,  impact"  \        -­‐d  15  \        -­‐-­‐filter  "filter  is  NULL  \                            and  is_coding  =  1  and  impact_severity  !=  ‘LOW’  \                            and  (aaf_1kg_eur  <=  0.005  or  aaf_1kg_eur  is  NULL)  \                            and  (aaf_esp_ea  <=  0.005  or  aaf_esp_ea  is  NULL)"  \          trio.trim.vep.denovo.db  |  wc  -­‐l 6  candidates! 29

Slide 31

Slide 31 text

6 candidates. Which is causal? Requires manual inspection… chrom    start          end              ref    alt    filter    qual          gene              impact                    variant_id    family_id    family_members    family_genotypes    samples    family_count   chr2      96525735    96525736    T        C        None        1929.31    ANKRD36C      non_syn_coding    2537                family1        1805,1847,4805    T/T,T/T,T/C              4805          1   chr2      96525749    96525750    T        A        None        1513.36    ANKRD36C      non_syn_coding    2538                family1        1805,1847,4805    T/T,T/T,T/A              4805          1   chr2      96525754    96525755    A        T        None        1699.28    ANKRD36C      non_syn_coding    2539                family1        1805,1847,4805    A/A,A/A,A/T              4805          1   chr15    41229630    41229631    T        G        None        2116.49    DLL4              non_syn_coding    7892                family1        1805,1847,4805    T/T,T/T,T/G              4805          1   chr17    55183812    55183813    A        G        None        2155.84    AKAP1            non_syn_coding    13311              family1        1805,1847,4805    A/A,A/A,A/G              4805          1   chr22    43027436    43027437    C        T        None        1320.03    CYB5R3          non_syn_coding    16718              family1        1805,1847,4805    C/C,C/C,C/T              4805          1   Phenotype: blue skin disease  $  gemini  de_novo  \        -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \                              filter,  qual,  gene,  impact"  \        -­‐d  15  \        -­‐-­‐filter  "filter  is  NULL  \                            and  is_coding  =  1  and  impact_severity  !=  ‘LOW’  \                            and  (aaf_1kg_eur  <=  0.005  or  aaf_1kg_eur  is  NULL)  \                            and  (aaf_esp_ea  <=  0.005  or  aaf_esp_ea  is  NULL)"  \          trio.trim.vep.denovo.db Which gene can we rule out at a glance? 30

Slide 32

Slide 32 text

Load the following files into IGV (Load from URL) and inspect your candidates BAM alignment files: ! https://s3.amazonaws.com/gemini-­‐tutorials/1805.workshop.bam   https://s3.amazonaws.com/gemini-­‐tutorials/1847.workshop.bam   https://s3.amazonaws.com/gemini-­‐tutorials/4805.workshop.bam VCF variant file: ! https://s3.amazonaws.com/gemini-­‐tutorials/trio.trim.vep.vcf.gz   ! 31