Mining genetic variation in any species with GEMINI

91f1e43339bdc1bd3690295bfaeeb17e?s=47 Aaron Quinlan
January 10, 2016

Mining genetic variation in any species with GEMINI

Plant and Animal Genomics
January 9, 2016

91f1e43339bdc1bd3690295bfaeeb17e?s=128

Aaron Quinlan

January 10, 2016
Tweet

Transcript

  1. Aaron Quinlan University of Utah quinlanlab.org @aaronquinlan Plant and Animal

    Genome Conference January 9, 2016 Mining genetic variation in any species with GEMINI
  2. Origins of GEMINI: Genetics of hypersensitivity to ionizing radiation Impact

    of standard radiation therapy in an undiagnosed ataxia-telangiectasia (A-T) patient •  140 such patients screened for dysfunction in known radiosensitivity genes (e.g., ATM and NBN). None found. •  Thus, opportunity to discover new genes underlying response to DNA damage. •  Hypothesis: each patient has a single gene disorder, yet the phenotype is only observed when they receive radiation.
  3. Interpreting genetic variation: context is crucial ...CCTCATGCATGGAAA... Genetic variation ...CCTCATGTATGGAAA...

    ...CCTCATGCATGGAAA... ...CCTCATGCATGGAAA... ...CCTCATGTATGGAAA... ...CCTCATGCATGGAAA... ...CCTCATGTATGGAAA... Chromatin marks DNA methylation RNA expression TF binding
  4. GEMINI: a flexible framework for exploring genome variation Uma Paila

    Brad Chapman github.com/arq5x/gemini gemini.rtfd.org Brent Pedersen
  5. How does GEMINI work?

  6. The GEMINI database model.

  7. Ad hoc variant exploration: genotype/phenotype filters gemini -q "SELECT *

    FROM variants WHERE impact_severity == ‘HIGH’ AND max_aaf_all <= 0.001” --gt-filter “(gt_types).(LDL > 300).(!=HOM_REF).(count > 100) and (gt_types).(LDL < 100).(!=HOM_REF).(count < 10)" Which rare, deleterious variants are enriched in people with high LDL (>300 mg/dL) levels? gemini -q "SELECT * FROM variants WHERE impact_severity == ‘HIGH’ AND max_aaf_all <= 0.001” --gt-filter “(gt_types).(breed=“angus”).(!=HOM_REF).(count > 100) and (gt_types).(breed=“belgian”).(!=HOM_REF).(count < 10)" Which rare, deleterious variants are enriched in Angus cattle but not Belgian Blue? (theoretical at the moment)
  8. Automated tools for disease inheritance models A/A A/G A/G Dominant

    A/G G/G A/G Recessive (consang.) C/C A/G A/A A/G C/T C/T Recessive (compound heterozygous) A/A A/G A/A De novo
  9. GEMINI is popular for rare disease research. UW Center for

    Mendelian Genomics
  10. Two key drawbacks of GEMINI •  Currently best for exome

    studies. Scales poorly for WGS. genome >> exome data size complexity (non-coding) •  Anthropocentic. Currently human (build 37) only. ! " " " " " "
  11. Improve speed for WGS datasets: use GQT Genotype Query Tools

    github.com/ryanlayer/gqt Nature Methods, 2015
  12. Improve speed for WGS datasets: RDBMS flexibility   

      SQLite (current) PostgreSQL MySQL CloudSQL BigQuery SQLAlchemy: database abstraction layer 
  13. Improve variant annotation speed and flexibility #CHROM POS ID REF

    ALT QUAL FILTER 2 41647 . A G 4495.41 PASS 2 45895 . A G 463.75 PASS 2 224970 . C T 4241.64 PASS 2 229934 . A G 5037.95 PASS 2 234130 . T G 3958 PASS 2 242732 . T TAAC 3193.19 PASS 2 242800 . T C 3929.77 PASS 2 243504 . C T 6628.06 PASS 2 243567 . T TA 3398.03 HRunFilter 2 262553 . T C 3503.49 PASS 2 264895 . G C 3774.13 PASS 2 269352 . G A 9802.28 PASS 2 276942 . A G 5878.58 PASS 2 277250 . G A 7051.35 PASS 2 279705 . C T 7139.54 PASS 2 283231 . A AT 6976.81 HRunFilter 2 675831 . G T 865.05 PASS 2 676177 . C G 4961.19 PASS 2 905368 . C G 101.98 ABFilter; 2 905369 . C G 28.97 ABFilter 2 905393 . C G 930.81 QDFilter 2 905427 . C G 140.17 QDFilter 2 905442 . A T 131.51 ABFilter 2 905492 . T G 550.3 QDFilter 2 905494 . C G 48.5 ABFilter 2 905533 . C T 320.33 ABFilter 2 905576 . T G 72.09 QDFilter 2 905581 . C T 1276.63 QDFilter 2 905595 . G C 390.15 ABFilter 2 905634 . A C 393.91 QDFilter 2 905687 . C G 3233.06 ABFilter 2 905736 . A T 1324.63 QDFilter 2 905763 . G C 15.12 ABFilter Tabix’ed . . .
  14. vcfanno: flexible and fast VCF annotation Naked VCF vcfanno VCF

    w/ annotations in INFO field Brent Pedersen https://github.com/brentp/vcfanno VCF, BED, GFF, BAM, (soon BW) Manuscript in prep.
  15. [[annotation]] file=“ExAC.v3.vcf” fields=[“AF”, “AC_Het”] names=[“exac_aaf”, “exac_num_het”] ops=[“first”, “first”] [[annotation]] file="dbsnp.b141.vcf.gz"

    fields=["ID"] names=["rs_ids"] ops=[“concat"] [[annotation]] file="gerp.elements.bed.gz" columns=[4,4] names=[“gerp_mean”,”gerp_var”] ops=[“mean”, "lua:variance(vals)"] vcfanno configuration file. Allows multiple annotations from each file Can rename the annotations in the resulting VCF Multiple operations to summarize the results of multiple hits in annot. file: mean, max, min concat, count, uniq first, flag Match on POS+REF+ALT for VCF annotations. Lua for custom computations. variance() defined in custom.js
  16. before and after vcfanno AC=11;AF=0.017 AC=11;AF=0.017; exac_aaf=0.0012; exac_num_het=8; rs_ids=1234; gerp_mean=7.25e-07

    gerp_var=1.39e-08 Naked VCF Dressed VCF [[annotation]] file=“ExAC.v3.vcf” fields=[“AF”, “AC_Het”] names=[“exac_aaf”, “exac_num_het”] ops=[“first”, “first”] [[annotation]] file="dbsnp.b141.vcf.gz" fields=["ID"] names=["rs_ids"] ops=[“concat"] [[annotation]] file="gerp.elements.bed.gz" columns=[4,4] names=[“gerp_mean”,”gerp_var”] ops=[“mean”, "js:variance(vals)"] vcfanno configuration file
  17. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [] cache result q.1 q.2 q.3 chromsweep is the fundamental algorithm underlying our bedtools software
  18. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [3.1 ] cache result q.1 q.2 q.3
  19. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {3.1 } [3.1 ] cache result q.1 q.2 q.3 q.1 =
  20. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {3.1, 1.1 } [3.1, 1.1 ] cache result q.1 q.2 q.3 q.1 =
  21. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [3.1, 1.1 ] cache result q.1 q.2 q.3 3.1, 1.1 q.1 =
  22. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [3.1 ] cache result q.1 q.2 q.3 3.1, 1.1 q.1 =
  23. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [] cache result q.1 q.2 q.3 3.1, 1.1 q.1 =
  24. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [] cache result q.1 q.2 q.3 q.2 = 3.1, 1.1 q.1 =
  25. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {3.2 } [3.2 ] cache result q.1 q.2 q.3 q.2 = 3.1, 1.1 q.1 =
  26. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {3.2,2.1 } [3.2,2.1 ] cache result q.1 q.2 q.3 q.2 = 3.1, 1.1 q.1 =
  27. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {3.2,2.1,1.2 } [3.2,2.1,1.2 ] cache result q.1 q.2 q.3 q.2 = 3.1, 1.1 q.1 =
  28. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {3.2,2.1,1.2 } [3.2,2.1,1.2 ] cache result q.1 q.2 q.3 q.2 = *2.1 stays in the cache 3.1, 1.1 q.1 =
  29. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [3.2,1.2 ] cache result q.1 q.2 q.3 q.2 = Now 2.1 is removed 3.2,2.1,1.2 3.1, 1.1 q.1 =
  30. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [3.2 ] cache result q.1 q.2 q.3 q.2 = 3.2,2.1,1.2 3.1, 1.1 q.1 =
  31. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [] cache result q.1 q.2 q.3 q.2 = 3.2,2.1,1.2 3.1, 1.1 q.1 =
  32. “chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2

    anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [] cache result q.1 q.2 q.3 q.3 = 1.3,2.2 q.2 = 3.2,2.1,1.2 3.1, 1.1 q.1 =
  33. vcfanno implements the first parallel chromsweep VCF anno1 anno2 anno3

    Step 1: partition the query set at “breaks” in the data or when N (e.g. 10) intervals are found Step 2: Use Tabix to extract the records germane to a chunk from each annotation file Step 3: Chromsweep each chunk independently.
  34. vcfanno is speedy. 18 annotations: 29K variants / sec @

    12 cores
  35. vcfanno VCF hg38… VCF from any species and any genome

    build Vcfanno configuration file points to appropriate annotations GEMINI database is created based on vcfanno configuration file GEMINI database creation should be ~60X faster How do we support other species?
  36. [[annotation]] file=“cpg.hg38.bed.gz" fields=[4] names=[“cpg_density"] ops=[“mean"] [[annotation]] file=“rmsk.hg38.bed.gz" fields=[4] names=[“rmsk"] ops=[“concat”]

    [[annotation]] file="cytoband.hg38.bed.gz" fields=[4] names=[“cytoband”] ops=[“distinct"] How? Simply point vcfanno to the relevant annotations Human (hg38) [[annotation]] file=“cpg.bosTau8.bed.gz" fields=[4] names=[“cpg_density"] ops=[“mean"] [[annotation]] file=“rmsk.bosTau8.bed.gz" fields=[4] names=[“rmsk"] ops=[“concat”] [[annotation]] file="cytoband.bosTau8.bed.gz" fields=[4] names=[“cytoband”] ops=[“distinct"] Cow (bosTau8)
  37. Allows the use of the same query, regardless of species

    gemini -q "SELECT * FROM variants WHERE cpg_density >= 0.9 Which variants overlap CpG islands whose CpG density is greater than or equal to 0.9? Human (hg38)  Cow (bosTau8) 
  38. Summary •  GEMINI is a flexible framework for exploring genetic

    variation from WES and WGS studies. •  Integrates variants, genotypes, phenotypes and annotations into a simple database. •  Current focus: •  Improving scalability for WGS •  Support for any (diploid) species •  Expected release: April 2016 github.com/arq5x/gemini gemini.rtfd.org
  39. Challenges •  Multi-allelic variants are a bugger. •  Even harder

    with polyploidy. The VCF format is ill-suited to this. •  Versioning & distributing annos. See GGD: https://github.com/arq5x/ggd
  40. Thank you. Funding: Brent Pedersen Ryan Layer Jim Havrilla

  41. First discovery with GEMINI: Defects in mitochondrial mRNA maturation cause

    radiosensitivity Sample A21: chr10, MTPAP, exon9, N478D homozygote