$30 off During Our Annual Pro Sale. View Details »

Mining genetic variation in any species with GEMINI

Aaron Quinlan
January 10, 2016

Mining genetic variation in any species with GEMINI

Plant and Animal Genomics
January 9, 2016

Aaron Quinlan

January 10, 2016
Tweet

More Decks by Aaron Quinlan

Other Decks in Science

Transcript

  1. Aaron Quinlan
    University of Utah
    quinlanlab.org
    @aaronquinlan

    Plant and Animal Genome Conference
    January 9, 2016


    Mining genetic variation in any species
    with GEMINI

    View Slide

  2. Origins of GEMINI:
    Genetics of hypersensitivity to ionizing radiation
    Impact of standard radiation
    therapy in an undiagnosed
    ataxia-telangiectasia (A-T) patient
    •  140 such patients screened for dysfunction
    in known radiosensitivity genes (e.g., ATM
    and NBN). None found.
    •  Thus, opportunity to discover new genes
    underlying response to DNA damage.
    •  Hypothesis: each patient has a single gene
    disorder, yet the phenotype is only
    observed when they receive radiation.

    View Slide

  3. Interpreting genetic variation: context is crucial
    ...CCTCATGCATGGAAA...
    Genetic variation
    ...CCTCATGTATGGAAA...
    ...CCTCATGCATGGAAA...
    ...CCTCATGCATGGAAA...
    ...CCTCATGTATGGAAA...
    ...CCTCATGCATGGAAA...
    ...CCTCATGTATGGAAA...
    Chromatin marks
    DNA methylation
    RNA expression
    TF binding

    View Slide

  4. GEMINI: a flexible framework for exploring genome variation
    Uma Paila Brad Chapman
    github.com/arq5x/gemini
    gemini.rtfd.org
    Brent Pedersen

    View Slide

  5. How does GEMINI work?

    View Slide

  6. The GEMINI database model.

    View Slide

  7. Ad hoc variant exploration: genotype/phenotype filters
    gemini -q "SELECT *
    FROM variants
    WHERE impact_severity == ‘HIGH’
    AND max_aaf_all <= 0.001”
    --gt-filter “(gt_types).(LDL > 300).(!=HOM_REF).(count > 100)
    and
    (gt_types).(LDL < 100).(!=HOM_REF).(count < 10)"
    Which rare, deleterious variants are enriched in people with
    high LDL (>300 mg/dL) levels?
    gemini -q "SELECT *
    FROM variants
    WHERE impact_severity == ‘HIGH’
    AND max_aaf_all <= 0.001”
    --gt-filter “(gt_types).(breed=“angus”).(!=HOM_REF).(count > 100)
    and
    (gt_types).(breed=“belgian”).(!=HOM_REF).(count < 10)"
    Which rare, deleterious variants are enriched in Angus cattle but not
    Belgian Blue? (theoretical at the moment)

    View Slide

  8. Automated tools for disease inheritance models
    A/A
    A/G
    A/G
    Dominant
    A/G
    G/G
    A/G
    Recessive (consang.)
    C/C
    A/G
    A/A
    A/G
    C/T
    C/T
    Recessive (compound heterozygous)
    A/A
    A/G
    A/A
    De novo

    View Slide

  9. GEMINI is popular for rare disease research.
    UW Center
    for Mendelian Genomics

    View Slide

  10. Two key drawbacks of GEMINI
    • 
    Currently best for exome studies. Scales poorly for WGS.
    genome >> exome
    data size
    complexity (non-coding)
    • 
    Anthropocentic. Currently human (build 37) only.
    ! " " " " " "

    View Slide

  11. Improve speed for WGS datasets: use GQT
    Genotype
    Query Tools
    github.com/ryanlayer/gqt
    Nature Methods, 2015

    View Slide

  12. Improve speed for WGS datasets: RDBMS flexibility
        
    SQLite
    (current)
    PostgreSQL MySQL CloudSQL BigQuery
    SQLAlchemy: database abstraction layer

    View Slide

  13. Improve variant annotation speed and flexibility
    #CHROM POS ID REF ALT QUAL FILTER
    2 41647 . A G 4495.41 PASS
    2 45895 . A G 463.75 PASS
    2 224970 . C T 4241.64 PASS
    2 229934 . A G 5037.95 PASS
    2 234130 . T G 3958 PASS
    2 242732 . T TAAC 3193.19 PASS
    2 242800 . T C 3929.77 PASS
    2 243504 . C T 6628.06 PASS
    2 243567 . T TA 3398.03 HRunFilter
    2 262553 . T C 3503.49 PASS
    2 264895 . G C 3774.13 PASS
    2 269352 . G A 9802.28 PASS
    2 276942 . A G 5878.58 PASS
    2 277250 . G A 7051.35 PASS
    2 279705 . C T 7139.54 PASS
    2 283231 . A AT 6976.81 HRunFilter
    2 675831 . G T 865.05 PASS
    2 676177 . C G 4961.19 PASS
    2 905368 . C G 101.98 ABFilter;
    2 905369 . C G 28.97 ABFilter
    2 905393 . C G 930.81 QDFilter
    2 905427 . C G 140.17 QDFilter
    2 905442 . A T 131.51 ABFilter
    2 905492 . T G 550.3 QDFilter
    2 905494 . C G 48.5 ABFilter
    2 905533 . C T 320.33 ABFilter
    2 905576 . T G 72.09 QDFilter
    2 905581 . C T 1276.63 QDFilter
    2 905595 . G C 390.15 ABFilter
    2 905634 . A C 393.91 QDFilter
    2 905687 . C G 3233.06 ABFilter
    2 905736 . A T 1324.63 QDFilter
    2 905763 . G C 15.12 ABFilter
    Tabix’ed
    .
    .
    .

    View Slide

  14. vcfanno: flexible and fast VCF annotation
    Naked
    VCF
    vcfanno VCF
    w/ annotations
    in INFO field
    Brent Pedersen
    https://github.com/brentp/vcfanno
    VCF, BED, GFF, BAM, (soon BW)
    Manuscript in prep.

    View Slide

  15. [[annotation]]
    file=“ExAC.v3.vcf”
    fields=[“AF”, “AC_Het”]
    names=[“exac_aaf”, “exac_num_het”]
    ops=[“first”, “first”]
    [[annotation]]
    file="dbsnp.b141.vcf.gz"
    fields=["ID"]
    names=["rs_ids"]
    ops=[“concat"]
    [[annotation]]
    file="gerp.elements.bed.gz"
    columns=[4,4]
    names=[“gerp_mean”,”gerp_var”]
    ops=[“mean”, "lua:variance(vals)"]
    vcfanno configuration file.
    Allows multiple annotations
    from each file
    Can rename the annotations
    in the resulting VCF
    Multiple operations to
    summarize the results
    of multiple hits in annot. file:
    mean, max, min
    concat, count, uniq
    first, flag
    Match on POS+REF+ALT
    for VCF annotations.
    Lua for
    custom computations.
    variance() defined in
    custom.js

    View Slide

  16. before and after vcfanno
    AC=11;AF=0.017
    AC=11;AF=0.017;
    exac_aaf=0.0012;
    exac_num_het=8;
    rs_ids=1234;
    gerp_mean=7.25e-07
    gerp_var=1.39e-08
    Naked
    VCF
    Dressed
    VCF
    [[annotation]]
    file=“ExAC.v3.vcf”
    fields=[“AF”, “AC_Het”]
    names=[“exac_aaf”, “exac_num_het”]
    ops=[“first”, “first”]
    [[annotation]]
    file="dbsnp.b141.vcf.gz"
    fields=["ID"]
    names=["rs_ids"]
    ops=[“concat"]
    [[annotation]]
    file="gerp.elements.bed.gz"
    columns=[4,4]
    names=[“gerp_mean”,”gerp_var”]
    ops=[“mean”, "js:variance(vals)"]
    vcfanno
    configuration file

    View Slide

  17. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {}
    []
    cache
    result
    q.1 q.2 q.3
    chromsweep is the fundamental algorithm
    underlying our bedtools software

    View Slide

  18. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {}
    [3.1
    ]
    cache
    result
    q.1 q.2 q.3

    View Slide

  19. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {3.1
    }
    [3.1
    ]
    cache
    result
    q.1 q.2 q.3
    q.1 =

    View Slide

  20. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {3.1, 1.1
    }
    [3.1, 1.1
    ]
    cache
    result
    q.1 q.2 q.3
    q.1 =

    View Slide

  21. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {}
    [3.1, 1.1
    ]
    cache
    result
    q.1 q.2 q.3
    3.1, 1.1
    q.1 =

    View Slide

  22. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {}
    [3.1
    ]
    cache
    result
    q.1 q.2 q.3
    3.1, 1.1
    q.1 =

    View Slide

  23. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {}
    []
    cache
    result
    q.1 q.2 q.3
    3.1, 1.1
    q.1 =

    View Slide

  24. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {}
    []
    cache
    result
    q.1 q.2 q.3
    q.2 =
    3.1, 1.1
    q.1 =

    View Slide

  25. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {3.2
    }
    [3.2
    ]
    cache
    result
    q.1 q.2 q.3
    q.2 =
    3.1, 1.1
    q.1 =

    View Slide

  26. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {3.2,2.1
    }
    [3.2,2.1
    ]
    cache
    result
    q.1 q.2 q.3
    q.2 =
    3.1, 1.1
    q.1 =

    View Slide

  27. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {3.2,2.1,1.2
    }
    [3.2,2.1,1.2
    ]
    cache
    result
    q.1 q.2 q.3
    q.2 =
    3.1, 1.1
    q.1 =

    View Slide

  28. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {3.2,2.1,1.2
    }
    [3.2,2.1,1.2
    ]
    cache
    result
    q.1 q.2 q.3
    q.2 =
    *2.1 stays in the cache
    3.1, 1.1
    q.1 =

    View Slide

  29. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {}
    [3.2,1.2
    ]
    cache
    result
    q.1 q.2 q.3
    q.2 =
    Now 2.1 is removed
    3.2,2.1,1.2
    3.1, 1.1
    q.1 =

    View Slide

  30. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {}
    [3.2
    ]
    cache
    result
    q.1 q.2 q.3
    q.2 = 3.2,2.1,1.2
    3.1, 1.1
    q.1 =

    View Slide

  31. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {}
    []
    cache
    result
    q.1 q.2 q.3
    q.2 = 3.2,2.1,1.2
    3.1, 1.1
    q.1 =

    View Slide

  32. “chromsweep”: a sweeping algorithm for pre-sorted data
    VCF
    anno1
    anno2
    anno3
    1.1 1.2 1.3
    2.1 2.2
    3.1 3.2
    {}
    []
    cache
    result
    q.1 q.2 q.3
    q.3 = 1.3,2.2
    q.2 = 3.2,2.1,1.2
    3.1, 1.1
    q.1 =

    View Slide

  33. vcfanno implements the first parallel chromsweep
    VCF
    anno1
    anno2
    anno3
    Step 1: partition the query set at “breaks” in the data or when N (e.g. 10) intervals are found
    Step 2: Use Tabix to extract the records germane to a chunk from each annotation file
    Step 3: Chromsweep each chunk independently.

    View Slide

  34. vcfanno is speedy.
    18 annotations:
    29K variants / sec @ 12 cores

    View Slide

  35. vcfanno
    VCF
    hg38…
    VCF from any species
    and any genome build
    Vcfanno configuration file
    points to appropriate
    annotations
    GEMINI database
    is created based
    on vcfanno
    configuration file
    GEMINI database creation
    should be ~60X faster
    How do we support other species?

    View Slide

  36. [[annotation]]
    file=“cpg.hg38.bed.gz"
    fields=[4]
    names=[“cpg_density"]
    ops=[“mean"]
    [[annotation]]
    file=“rmsk.hg38.bed.gz"
    fields=[4]
    names=[“rmsk"]
    ops=[“concat”]
    [[annotation]]
    file="cytoband.hg38.bed.gz"
    fields=[4]
    names=[“cytoband”]
    ops=[“distinct"]
    How? Simply point vcfanno to the relevant annotations
    Human (hg38)
    [[annotation]]
    file=“cpg.bosTau8.bed.gz"
    fields=[4]
    names=[“cpg_density"]
    ops=[“mean"]
    [[annotation]]
    file=“rmsk.bosTau8.bed.gz"
    fields=[4]
    names=[“rmsk"]
    ops=[“concat”]
    [[annotation]]
    file="cytoband.bosTau8.bed.gz"
    fields=[4]
    names=[“cytoband”]
    ops=[“distinct"]
    Cow (bosTau8)

    View Slide

  37. Allows the use of the same query, regardless of species
    gemini -q "SELECT *
    FROM variants
    WHERE cpg_density >= 0.9
    Which variants overlap CpG islands whose CpG density is greater
    than or equal to 0.9?
    Human (hg38)

    Cow (bosTau8)

    View Slide

  38. Summary
    •  GEMINI is a flexible framework for exploring
    genetic variation from WES and WGS studies.
    •  Integrates variants, genotypes, phenotypes and
    annotations into a simple database.

    •  Current focus:
    •  Improving scalability for WGS
    •  Support for any (diploid) species

    •  Expected release: April 2016 github.com/arq5x/gemini
    gemini.rtfd.org

    View Slide

  39. Challenges
    •  Multi-allelic variants are a bugger.
    •  Even harder with polyploidy. The
    VCF format is ill-suited to this.
    •  Versioning & distributing annos.
    See GGD: https://github.com/arq5x/ggd

    View Slide

  40. Thank you.
    Funding:
    Brent
    Pedersen
    Ryan
    Layer
    Jim
    Havrilla

    View Slide

  41. First discovery with GEMINI:
    Defects in mitochondrial mRNA maturation cause radiosensitivity
    Sample A21: chr10, MTPAP, exon9, N478D homozygote

    View Slide