Slide 1

Slide 1 text

Aaron Quinlan University of Utah quinlanlab.org @aaronquinlan Plant and Animal Genome Conference January 9, 2016 Mining genetic variation in any species with GEMINI

Slide 2

Slide 2 text

Origins of GEMINI: Genetics of hypersensitivity to ionizing radiation Impact of standard radiation therapy in an undiagnosed ataxia-telangiectasia (A-T) patient •  140 such patients screened for dysfunction in known radiosensitivity genes (e.g., ATM and NBN). None found. •  Thus, opportunity to discover new genes underlying response to DNA damage. •  Hypothesis: each patient has a single gene disorder, yet the phenotype is only observed when they receive radiation.

Slide 3

Slide 3 text

Interpreting genetic variation: context is crucial ...CCTCATGCATGGAAA... Genetic variation ...CCTCATGTATGGAAA... ...CCTCATGCATGGAAA... ...CCTCATGCATGGAAA... ...CCTCATGTATGGAAA... ...CCTCATGCATGGAAA... ...CCTCATGTATGGAAA... Chromatin marks DNA methylation RNA expression TF binding

Slide 4

Slide 4 text

GEMINI: a flexible framework for exploring genome variation Uma Paila Brad Chapman github.com/arq5x/gemini gemini.rtfd.org Brent Pedersen

Slide 5

Slide 5 text

How does GEMINI work?

Slide 6

Slide 6 text

The GEMINI database model.

Slide 7

Slide 7 text

Ad hoc variant exploration: genotype/phenotype filters gemini -q "SELECT * FROM variants WHERE impact_severity == ‘HIGH’ AND max_aaf_all <= 0.001” --gt-filter “(gt_types).(LDL > 300).(!=HOM_REF).(count > 100) and (gt_types).(LDL < 100).(!=HOM_REF).(count < 10)" Which rare, deleterious variants are enriched in people with high LDL (>300 mg/dL) levels? gemini -q "SELECT * FROM variants WHERE impact_severity == ‘HIGH’ AND max_aaf_all <= 0.001” --gt-filter “(gt_types).(breed=“angus”).(!=HOM_REF).(count > 100) and (gt_types).(breed=“belgian”).(!=HOM_REF).(count < 10)" Which rare, deleterious variants are enriched in Angus cattle but not Belgian Blue? (theoretical at the moment)

Slide 8

Slide 8 text

Automated tools for disease inheritance models A/A A/G A/G Dominant A/G G/G A/G Recessive (consang.) C/C A/G A/A A/G C/T C/T Recessive (compound heterozygous) A/A A/G A/A De novo

Slide 9

Slide 9 text

GEMINI is popular for rare disease research. UW Center for Mendelian Genomics

Slide 10

Slide 10 text

Two key drawbacks of GEMINI •  Currently best for exome studies. Scales poorly for WGS. genome >> exome data size complexity (non-coding) •  Anthropocentic. Currently human (build 37) only. ! " " " " " "

Slide 11

Slide 11 text

Improve speed for WGS datasets: use GQT Genotype Query Tools github.com/ryanlayer/gqt Nature Methods, 2015

Slide 12

Slide 12 text

Improve speed for WGS datasets: RDBMS flexibility      SQLite (current) PostgreSQL MySQL CloudSQL BigQuery SQLAlchemy: database abstraction layer 

Slide 13

Slide 13 text

Improve variant annotation speed and flexibility #CHROM POS ID REF ALT QUAL FILTER 2 41647 . A G 4495.41 PASS 2 45895 . A G 463.75 PASS 2 224970 . C T 4241.64 PASS 2 229934 . A G 5037.95 PASS 2 234130 . T G 3958 PASS 2 242732 . T TAAC 3193.19 PASS 2 242800 . T C 3929.77 PASS 2 243504 . C T 6628.06 PASS 2 243567 . T TA 3398.03 HRunFilter 2 262553 . T C 3503.49 PASS 2 264895 . G C 3774.13 PASS 2 269352 . G A 9802.28 PASS 2 276942 . A G 5878.58 PASS 2 277250 . G A 7051.35 PASS 2 279705 . C T 7139.54 PASS 2 283231 . A AT 6976.81 HRunFilter 2 675831 . G T 865.05 PASS 2 676177 . C G 4961.19 PASS 2 905368 . C G 101.98 ABFilter; 2 905369 . C G 28.97 ABFilter 2 905393 . C G 930.81 QDFilter 2 905427 . C G 140.17 QDFilter 2 905442 . A T 131.51 ABFilter 2 905492 . T G 550.3 QDFilter 2 905494 . C G 48.5 ABFilter 2 905533 . C T 320.33 ABFilter 2 905576 . T G 72.09 QDFilter 2 905581 . C T 1276.63 QDFilter 2 905595 . G C 390.15 ABFilter 2 905634 . A C 393.91 QDFilter 2 905687 . C G 3233.06 ABFilter 2 905736 . A T 1324.63 QDFilter 2 905763 . G C 15.12 ABFilter Tabix’ed . . .

Slide 14

Slide 14 text

vcfanno: flexible and fast VCF annotation Naked VCF vcfanno VCF w/ annotations in INFO field Brent Pedersen https://github.com/brentp/vcfanno VCF, BED, GFF, BAM, (soon BW) Manuscript in prep.

Slide 15

Slide 15 text

[[annotation]] file=“ExAC.v3.vcf” fields=[“AF”, “AC_Het”] names=[“exac_aaf”, “exac_num_het”] ops=[“first”, “first”] [[annotation]] file="dbsnp.b141.vcf.gz" fields=["ID"] names=["rs_ids"] ops=[“concat"] [[annotation]] file="gerp.elements.bed.gz" columns=[4,4] names=[“gerp_mean”,”gerp_var”] ops=[“mean”, "lua:variance(vals)"] vcfanno configuration file. Allows multiple annotations from each file Can rename the annotations in the resulting VCF Multiple operations to summarize the results of multiple hits in annot. file: mean, max, min concat, count, uniq first, flag Match on POS+REF+ALT for VCF annotations. Lua for custom computations. variance() defined in custom.js

Slide 16

Slide 16 text

before and after vcfanno AC=11;AF=0.017 AC=11;AF=0.017; exac_aaf=0.0012; exac_num_het=8; rs_ids=1234; gerp_mean=7.25e-07 gerp_var=1.39e-08 Naked VCF Dressed VCF [[annotation]] file=“ExAC.v3.vcf” fields=[“AF”, “AC_Het”] names=[“exac_aaf”, “exac_num_het”] ops=[“first”, “first”] [[annotation]] file="dbsnp.b141.vcf.gz" fields=["ID"] names=["rs_ids"] ops=[“concat"] [[annotation]] file="gerp.elements.bed.gz" columns=[4,4] names=[“gerp_mean”,”gerp_var”] ops=[“mean”, "js:variance(vals)"] vcfanno configuration file

Slide 17

Slide 17 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [] cache result q.1 q.2 q.3 chromsweep is the fundamental algorithm underlying our bedtools software

Slide 18

Slide 18 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [3.1 ] cache result q.1 q.2 q.3

Slide 19

Slide 19 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {3.1 } [3.1 ] cache result q.1 q.2 q.3 q.1 =

Slide 20

Slide 20 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {3.1, 1.1 } [3.1, 1.1 ] cache result q.1 q.2 q.3 q.1 =

Slide 21

Slide 21 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [3.1, 1.1 ] cache result q.1 q.2 q.3 3.1, 1.1 q.1 =

Slide 22

Slide 22 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [3.1 ] cache result q.1 q.2 q.3 3.1, 1.1 q.1 =

Slide 23

Slide 23 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [] cache result q.1 q.2 q.3 3.1, 1.1 q.1 =

Slide 24

Slide 24 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [] cache result q.1 q.2 q.3 q.2 = 3.1, 1.1 q.1 =

Slide 25

Slide 25 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {3.2 } [3.2 ] cache result q.1 q.2 q.3 q.2 = 3.1, 1.1 q.1 =

Slide 26

Slide 26 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {3.2,2.1 } [3.2,2.1 ] cache result q.1 q.2 q.3 q.2 = 3.1, 1.1 q.1 =

Slide 27

Slide 27 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {3.2,2.1,1.2 } [3.2,2.1,1.2 ] cache result q.1 q.2 q.3 q.2 = 3.1, 1.1 q.1 =

Slide 28

Slide 28 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {3.2,2.1,1.2 } [3.2,2.1,1.2 ] cache result q.1 q.2 q.3 q.2 = *2.1 stays in the cache 3.1, 1.1 q.1 =

Slide 29

Slide 29 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [3.2,1.2 ] cache result q.1 q.2 q.3 q.2 = Now 2.1 is removed 3.2,2.1,1.2 3.1, 1.1 q.1 =

Slide 30

Slide 30 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [3.2 ] cache result q.1 q.2 q.3 q.2 = 3.2,2.1,1.2 3.1, 1.1 q.1 =

Slide 31

Slide 31 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [] cache result q.1 q.2 q.3 q.2 = 3.2,2.1,1.2 3.1, 1.1 q.1 =

Slide 32

Slide 32 text

“chromsweep”: a sweeping algorithm for pre-sorted data VCF anno1 anno2 anno3 1.1 1.2 1.3 2.1 2.2 3.1 3.2 {} [] cache result q.1 q.2 q.3 q.3 = 1.3,2.2 q.2 = 3.2,2.1,1.2 3.1, 1.1 q.1 =

Slide 33

Slide 33 text

vcfanno implements the first parallel chromsweep VCF anno1 anno2 anno3 Step 1: partition the query set at “breaks” in the data or when N (e.g. 10) intervals are found Step 2: Use Tabix to extract the records germane to a chunk from each annotation file Step 3: Chromsweep each chunk independently.

Slide 34

Slide 34 text

vcfanno is speedy. 18 annotations: 29K variants / sec @ 12 cores

Slide 35

Slide 35 text

vcfanno VCF hg38… VCF from any species and any genome build Vcfanno configuration file points to appropriate annotations GEMINI database is created based on vcfanno configuration file GEMINI database creation should be ~60X faster How do we support other species?

Slide 36

Slide 36 text

[[annotation]] file=“cpg.hg38.bed.gz" fields=[4] names=[“cpg_density"] ops=[“mean"] [[annotation]] file=“rmsk.hg38.bed.gz" fields=[4] names=[“rmsk"] ops=[“concat”] [[annotation]] file="cytoband.hg38.bed.gz" fields=[4] names=[“cytoband”] ops=[“distinct"] How? Simply point vcfanno to the relevant annotations Human (hg38) [[annotation]] file=“cpg.bosTau8.bed.gz" fields=[4] names=[“cpg_density"] ops=[“mean"] [[annotation]] file=“rmsk.bosTau8.bed.gz" fields=[4] names=[“rmsk"] ops=[“concat”] [[annotation]] file="cytoband.bosTau8.bed.gz" fields=[4] names=[“cytoband”] ops=[“distinct"] Cow (bosTau8)

Slide 37

Slide 37 text

Allows the use of the same query, regardless of species gemini -q "SELECT * FROM variants WHERE cpg_density >= 0.9 Which variants overlap CpG islands whose CpG density is greater than or equal to 0.9? Human (hg38)  Cow (bosTau8) 

Slide 38

Slide 38 text

Summary •  GEMINI is a flexible framework for exploring genetic variation from WES and WGS studies. •  Integrates variants, genotypes, phenotypes and annotations into a simple database. •  Current focus: •  Improving scalability for WGS •  Support for any (diploid) species •  Expected release: April 2016 github.com/arq5x/gemini gemini.rtfd.org

Slide 39

Slide 39 text

Challenges •  Multi-allelic variants are a bugger. •  Even harder with polyploidy. The VCF format is ill-suited to this. •  Versioning & distributing annos. See GGD: https://github.com/arq5x/ggd

Slide 40

Slide 40 text

Thank you. Funding: Brent Pedersen Ryan Layer Jim Havrilla

Slide 41

Slide 41 text

First discovery with GEMINI: Defects in mitochondrial mRNA maturation cause radiosensitivity Sample A21: chr10, MTPAP, exon9, N478D homozygote