Origins of GEMINI: Genetics of hypersensitivity to ionizing radiation
Impact of standard radiation
therapy in an undiagnosed
ataxia-telangiectasia (A-T) patient
• 140 such patients screened for dysfunction in known radiosensitivity genes (e.g., ATM and NBN). None found.
• Thus, opportunity to discover new genes underlying response to DNA damage.
• Hypothesis: each patient has a single gene disorder, yet the phenotype is only observed when they receive radiation.
Ad hoc variant exploration: genotype/phenotype filters
gemini -q "SELECT * FROM variants WHERE impact_severity == ‘HIGH’ AND max_aaf_all <= 0.001” --gt-filter “(gt_types).(LDL > 300).(!=HOM_REF).(count > 100) and (gt_types).(LDL < 100).(!=HOM_REF).(count < 10)" Which rare, deleterious variants are enriched in people with
high LDL (>300 mg/dL) levels?
gemini -q "SELECT * FROM variants WHERE impact_severity == ‘HIGH’ AND max_aaf_all <= 0.001” --gt-filter “(gt_types).(breed=“angus”).(!=HOM_REF).(count > 100) and (gt_types).(breed=“belgian”).(!=HOM_REF).(count < 10)" Which rare, deleterious variants are enriched in Angus cattle but not
Belgian Blue? (theoretical at the moment)
Two key drawbacks of GEMINI
• Currently best for exome studies. Scales poorly for WGS.
genome >> exome
data size
complexity (non-coding)
• Anthropocentic. Currently human (build 37) only.
! " " " " " "
Improve variant annotation speed and flexibility
#CHROM POS ID REF ALT QUAL FILTER 2 41647 . A G 4495.41 PASS 2 45895 . A G 463.75 PASS 2 224970 . C T 4241.64 PASS 2 229934 . A G 5037.95 PASS 2 234130 . T G 3958 PASS 2 242732 . T TAAC 3193.19 PASS 2 242800 . T C 3929.77 PASS 2 243504 . C T 6628.06 PASS 2 243567 . T TA 3398.03 HRunFilter 2 262553 . T C 3503.49 PASS 2 264895 . G C 3774.13 PASS 2 269352 . G A 9802.28 PASS 2 276942 . A G 5878.58 PASS 2 277250 . G A 7051.35 PASS 2 279705 . C T 7139.54 PASS 2 283231 . A AT 6976.81 HRunFilter 2 675831 . G T 865.05 PASS 2 676177 . C G 4961.19 PASS 2 905368 . C G 101.98 ABFilter; 2 905369 . C G 28.97 ABFilter 2 905393 . C G 930.81 QDFilter 2 905427 . C G 140.17 QDFilter 2 905442 . A T 131.51 ABFilter 2 905492 . T G 550.3 QDFilter 2 905494 . C G 48.5 ABFilter 2 905533 . C T 320.33 ABFilter 2 905576 . T G 72.09 QDFilter 2 905581 . C T 1276.63 QDFilter 2 905595 . G C 390.15 ABFilter 2 905634 . A C 393.91 QDFilter 2 905687 . C G 3233.06 ABFilter 2 905736 . A T 1324.63 QDFilter 2 905763 . G C 15.12 ABFilter Tabix’ed
.
.
.
[[annotation]] file=“ExAC.v3.vcf” fields=[“AF”, “AC_Het”] names=[“exac_aaf”, “exac_num_het”] ops=[“first”, “first”] [[annotation]] file="dbsnp.b141.vcf.gz" fields=["ID"] names=["rs_ids"] ops=[“concat"] [[annotation]] file="gerp.elements.bed.gz" columns=[4,4] names=[“gerp_mean”,”gerp_var”] ops=[“mean”, "lua:variance(vals)"] vcfanno configuration file.
Allows multiple annotations
from each file
Can rename the annotations
in the resulting VCF
Multiple operations to
summarize the results
of multiple hits in annot. file:
mean, max, min concat, count, uniq first, flag Match on POS+REF+ALT
for VCF annotations.
Lua for
custom computations.
variance() defined in
custom.js
“chromsweep”: a sweeping algorithm for pre-sorted data
VCF
anno1
anno2
anno3
1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [] cache
result
q.1 q.2 q.3 chromsweep is the fundamental algorithm
underlying our bedtools software
vcfanno implements the first parallel chromsweep
VCF
anno1
anno2
anno3
Step 1: partition the query set at “breaks” in the data or when N (e.g. 10) intervals are found
Step 2: Use Tabix to extract the records germane to a chunk from each annotation file
Step 3: Chromsweep each chunk independently.
vcfanno VCF
hg38…
VCF from any species
and any genome build
Vcfanno configuration file
points to appropriate
annotations
GEMINI database
is created based
on vcfanno
configuration file
GEMINI database creation
should be ~60X faster
How do we support other species?
Allows the use of the same query, regardless of species
gemini -q "SELECT * FROM variants WHERE cpg_density >= 0.9 Which variants overlap CpG islands whose CpG density is greater
than or equal to 0.9?
Human (hg38)
Cow (bosTau8)
Summary
• GEMINI is a flexible framework for exploring genetic variation from WES and WGS studies.
• Integrates variants, genotypes, phenotypes and annotations into a simple database.
• Current focus:
• Improving scalability for WGS
• Support for any (diploid) species
• Expected release: April 2016
github.com/arq5x/gemini gemini.rtfd.org
Challenges
• Multi-allelic variants are a bugger.
• Even harder with polyploidy. The VCF format is ill-suited to this.
• Versioning & distributing annos.
See GGD: https://github.com/arq5x/ggd