Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Identifying de novo mutations with GEMINI

Aaron Quinlan
August 18, 2015

Identifying de novo mutations with GEMINI

Aaron Quinlan

August 18, 2015
Tweet

More Decks by Aaron Quinlan

Other Decks in Science

Transcript

  1. Identifying de novo mutations
    with GEMINI
    Please refer to the following Github Gist to find each command for this session.
    Commands should be copy/pasted from this Gist
    Aaron Quinlan
    University of Utah
    !
    !
    !
    !
    !
    quinlanlab.org
    1
    https://gist.github.com/arq5x/9e1928638397ba45da2e#file-denovo-sh

    View Slide

  2. Automated tools for disease inheritance models
    2

    View Slide

  3. Automated tools for disease inheritance models
    2

    View Slide

  4. Automated tools for disease inheritance models
    3

    View Slide

  5. Common options for disease model tools.
    4

    View Slide

  6. Why search for de novo mutations?
    Brian O’Roak
    5

    View Slide

  7. High impact variants
    Brian O’Roak
    6

    View Slide

  8. De novo mutations
    7

    View Slide

  9. How many de novo mutations
    should we expect?
    8

    View Slide

  10. De novo mutations (rough expectations)
    9

    View Slide

  11. In practice, it’s not so simple.
    Brian O’Roak
    10

    View Slide

  12. 11

    View Slide

  13. Why are there so many artifacts?
    • Prior probabilities - the more interesting something is, the less
    likely it is to be real
    !
    • If something can go wrong, it will.
    • Incorrect genotype assignment
    • Low coverage in one or more of the individuals in the family
    (especially the parents…why?)
    • Mismapping
    • Misalignment
    • Paralogy
    • Systematic artifacts
    • Somatic events
    12

    View Slide

  14. Detective work with GEMINI
    13

    View Slide

  15. The de_novo tool in GEMINI
    http://gemini.readthedocs.org/en/latest/content/tools.html#de-novo-identifying-potential-de-novo-mutations
    14

    View Slide

  16. Create a GEMINI database from a VCF
    Notes:
    1. The VCF has been normalized and decomposed with VT
    2. The VCF has been annotated with VEP.
    $  curl  https://s3.amazonaws.com/gemini-­‐tutorials/trio.trim.vep.vcf.gz  >  trio.trim.vep.vcf.gz  
    $  curl  https://s3.amazonaws.com/gemini-­‐tutorials/denovo.ped  >  denovo.ped  
    $  gemini  load  -­‐-­‐cores  4  \  
                               -­‐v  trio.trim.vep.vcf.gz  \  
                               -­‐t  VEP  \  
                               -­‐-­‐skip-­‐gene-­‐tables  -­‐-­‐skip-­‐cadd  -­‐-­‐skip-­‐gerp-­‐bp  \  
                               -­‐p  de_novo.ped  \  
    !
                     trio.trim.vep.denovo.db
    Note: copy and paste the full command from the Github Gist to avoid errors
    ~8 minutes
    http://gemini.readthedocs.org/en/latest/content/preprocessing.html#step-1-split-left-align-and-trim-variants
    15

    View Slide

  17. Normalization and decomposition are required preprocessing steps
    Variant decomposition
    http://genome.sph.umich.edu/wiki/Vt#Decompose
    Variant normalization
    http://genome.sph.umich.edu/wiki/File:Normalization_mnp.png
    http://gemini.readthedocs.org/en/latest/
    content/preprocessing.html#preprocessing-
    and-loading-a-vcf-file-into-gemini
    Details can be found in the
    GEMINI documentation

    16

    View Slide

  18. Running the de_novo tool
     $  gemini  de_novo  trio.trim.vep.denovo.db
    Note: copy and paste the full command from the Github Gist
    17

    View Slide

  19. Information overload
    There are currently
    115 columns in the
    variants table.


    Perhaps a bit of
    overkill for a typical
    analysis


    http://gemini.readthedocs.org/en/latest/content/database_schema.html#the-variants-table
    18

    View Slide

  20. Limit the attributes returned w/ the -­‐-­‐columns option.
     $  gemini  de_novo  \  
         -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \  
                               filter,  qual,  gene,  impact"  \  
         trio.trim.vep.denovo.db
    Note: copy and paste the full command from the Github Gist
    19

    View Slide

  21. Limit the attributes returned w/ the -­‐-­‐columns option.
    http://gemini.readthedocs.org/en/latest/content/tools.html#common-args-common-arguments
     $  gemini  de_novo  \  
         -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \  
                               filter,  qual,  gene,  impact"  \  
         trio.trim.vep.denovo.db
    Note: copy and paste the full command from the Github Gist
    20

    View Slide

  22. Better, but there are still so many (likely false) candidates.
     $  gemini  de_novo  \  
         -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \  
                               filter,  qual,  gene,  impact"  \  
         trio.trim.vep.denovo.db  |  wc  -­‐l
    Note: copy and paste the full command from the Github Gist
    771  candidates!
    21

    View Slide

  23. Causes of erroneous genotype predictions: lack of depth
    22

    View Slide

  24. Let’s enforce a minimum sequence depth for each subject: -­‐d
     $  gemini  de_novo  \  
         -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \  
                               filter,  qual,  gene,  impact"  \  
         -­‐d  15  \  
         trio.trim.vep.denovo.db  |  wc  -­‐l
    Note: copy and paste the full command from the Github Gist
    676  candidates
    23

    View Slide

  25. Causes of erroneous genotype predictions: low quality variants
    24

    View Slide

  26. Require that the mutation passes GATK QC with -­‐-­‐filter
     $  gemini  de_novo  \  
         -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \  
                               filter,  qual,  gene,  impact"  \  
         -­‐d  15  \  
         -­‐-­‐filter  "filter  is  NULL"  \  
         trio.trim.vep.denovo.db  |  wc  -­‐l
    Note: copy and paste the full command from the Github Gist
    55  candidates
    25

    View Slide

  27. Require that the mutation is likely to have functional consequence
    26

    View Slide

  28. Require that the mutation is likely to have functional consequence
     $  gemini  de_novo  \  
         -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \  
                               filter,  qual,  gene,  impact"  \  
         -­‐d  15  \  
         -­‐-­‐filter  "filter  is  NULL  and  impact_severity  !=  ‘LOW’”  \  
         trio.trim.vep.denovo.db  |  wc  -­‐l
    Note: copy and paste the full command from the Github Gist
    13  candidates
    27

    View Slide

  29. Require that the mutation is not likely to be a known polymorphism
    28

    View Slide

  30. Require that the mutation is not likely to be a known polymorphism
    Note: copy and paste the full command from the Github Gist
     $  gemini  de_novo  \  
         -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \  
                               filter,  qual,  gene,  impact"  \  
         -­‐d  15  \  
         -­‐-­‐filter  "filter  is  NULL  \  
                             and  is_coding  =  1  and  impact_severity  !=  ‘LOW’  \  
                             and  (aaf_1kg_eur  <=  0.005  or  aaf_1kg_eur  is  NULL)  \  
                             and  (aaf_esp_ea  <=  0.005  or  aaf_esp_ea  is  NULL)"  \    
         trio.trim.vep.denovo.db  |  wc  -­‐l
    6  candidates!
    29

    View Slide

  31. 6 candidates. Which is causal? Requires manual inspection…
    chrom    start          end              ref    alt    filter    qual          gene              impact                    variant_id    family_id    family_members    family_genotypes    samples    family_count  
    chr2      96525735    96525736    T        C        None        1929.31    ANKRD36C      non_syn_coding    2537                family1        1805,1847,4805    T/T,T/T,T/C              4805          1  
    chr2      96525749    96525750    T        A        None        1513.36    ANKRD36C      non_syn_coding    2538                family1        1805,1847,4805    T/T,T/T,T/A              4805          1  
    chr2      96525754    96525755    A        T        None        1699.28    ANKRD36C      non_syn_coding    2539                family1        1805,1847,4805    A/A,A/A,A/T              4805          1  
    chr15    41229630    41229631    T        G        None        2116.49    DLL4              non_syn_coding    7892                family1        1805,1847,4805    T/T,T/T,T/G              4805          1  
    chr17    55183812    55183813    A        G        None        2155.84    AKAP1            non_syn_coding    13311              family1        1805,1847,4805    A/A,A/A,A/G              4805          1  
    chr22    43027436    43027437    C        T        None        1320.03    CYB5R3          non_syn_coding    16718              family1        1805,1847,4805    C/C,C/C,C/T              4805          1  
    Phenotype: blue skin disease
     $  gemini  de_novo  \  
         -­‐-­‐columns  "chrom,  start,  end,  ref,  alt,  \  
                               filter,  qual,  gene,  impact"  \  
         -­‐d  15  \  
         -­‐-­‐filter  "filter  is  NULL  \  
                             and  is_coding  =  1  and  impact_severity  !=  ‘LOW’  \  
                             and  (aaf_1kg_eur  <=  0.005  or  aaf_1kg_eur  is  NULL)  \  
                             and  (aaf_esp_ea  <=  0.005  or  aaf_esp_ea  is  NULL)"  \    
         trio.trim.vep.denovo.db
    Which gene can we rule out at a glance?
    30

    View Slide

  32. Load the following files into IGV (Load from URL) and inspect your candidates
    BAM alignment files:
    !
    https://s3.amazonaws.com/gemini-­‐tutorials/1805.workshop.bam  
    https://s3.amazonaws.com/gemini-­‐tutorials/1847.workshop.bam  
    https://s3.amazonaws.com/gemini-­‐tutorials/4805.workshop.bam
    VCF variant file:
    !
    https://s3.amazonaws.com/gemini-­‐tutorials/trio.trim.vep.vcf.gz  
    !
    31

    View Slide