Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making queries of the genome less difficult.

Aaron Quinlan
October 31, 2015

Making queries of the genome less difficult.

Presented on 30-Oct-2105 in the "Sequencing pipelines and assembly" session at the CSHL Genome Informatics Meeting.

Aaron Quinlan

October 31, 2015
Tweet

More Decks by Aaron Quinlan

Other Decks in Science

Transcript

  1. Aaron Quinlan University of Utah quinlanlab.org @aaronquinlan ! ! !

    ! Making queries of the genome less difficult.
  2. • inconsistent chromosome labels. • different sorting criteria. • mixed

    UNIX/Windows newlines. • file violates spec with vigor. • program expects exact extension. • file is gzipp’ed, not bgzipp’ed. • annotations use diff. genome builds. • tool only works for one format. • tool is hard-coded for specific build. • tool requires act of gods to compile.
  3. vcfanno will annotate your VCF with panache. Naked VCF vcfanno

      +   configuration   file VCF w/ annotations in INFO field Brent Pedersen https://github.com/brentp/vcfanno VCF, BED, GFF, BAM, (soon BW)
  4. [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file.
  5. [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file. Match on POS+REF+ALT for VCF annotations.
  6. [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file. Allows multiple annotations from each file Match on POS+REF+ALT for VCF annotations.
  7. [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file. Allows multiple annotations from each file Can rename the annotations in the resulting VCF Match on POS+REF+ALT for VCF annotations.
  8. [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file. Allows multiple annotations from each file Can rename the annotations in the resulting VCF Multiple operations to summarize the results of multiple hits in annot. file: mean,  max,  min   concat,  count,  uniq   first,  flag Match on POS+REF+ALT for VCF annotations.
  9. [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file. Allows multiple annotations from each file Can rename the annotations in the resulting VCF Multiple operations to summarize the results of multiple hits in annot. file: mean,  max,  min   concat,  count,  uniq   first,  flag Match on POS+REF+ALT for VCF annotations. Javascript for custom computations. variance() defined in custom.js
  10. before and after vcfanno vcfanno AC=11;AF=0.017 AC=11;AF=0.017;   exac_aaf=0.0012;  

    exac_num_het=8;   rs_ids=1234;   gerp_mean=7.25e-­‐07   gerp_var=1.39e-­‐08 Naked VCF Dressed VCF [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]   ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"]
  11. Individual-centric queries with Genotype Query Tools (GQT) github.com/ryanlayer/gqt In press.

    Ryan Layer http://biorxiv.org/content/early/2015/04/20/018259
  12. bcftools  view  \   -­‐r  17:43044295-­‐43125483  \   1000g.vcf  

                                       OR   ! tabix  1000g.vcf  17:43044295-­‐43125483   ! Existing tools handle variant-centric queries well
  13. Idea: transpose the genotype matrix G GT Note: other tricks

    included for speed/compression, please see manuscript
  14. Bitmap indices of variant metadata (VEP consequence) VEP consequence bitmap:

    1   0   0   0   0   0   0   …   0 synon. missense 0   0   0   0   0   0   1   …   0 stopgain 0   0   0   0   0   0   0   …   1 splice 0   0   0   0   1   0   0   …   0 .  .  .
  15. Bitmap indices of genotype metadata (depth) Genotype depth bitmap 1

      0   0   0   0   0   0   …   0 0 1 0   1   0   0   0   0   0   …   0 2 0   0   0   0   0   0   1   …   0 3 0   0   0   1   0   0   0   …   0 10 0   0   0   0   0   0   0   …   1 20 0   0   0   0   0   1   0   …   0 25-30 0   0   0   0   1   0   0   …   0 >30 0   0   1   0   0   0   0   …   0 Ongoing: how to optimize lossiness of quantization?
  16. Future: A Genome Query Language? Variant-centric (bcftools, BGT) + =

    General Genome Query Language (based on discussions w/ Heng Li) VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in cases and rare in controls. b gqt query study.gqt study.db -p "phenotype == 2" -g "maf() > 0.05" -p "phenotype == 1" -g "maf() < 0.05" gqt -p -g b VCF In F V In VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in g Individual-centric (GQT, BGT) Individuals Variants Variants Individuals
  17. Future: A Genome Query Language? Variant-centric (bcftools, BGT) + =

    General Genome Query Language (based on discussions w/ Heng Li) VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in cases and rare in controls. b gqt query study.gqt study.db -p "phenotype == 2" -g "maf() > 0.05" -p "phenotype == 1" -g "maf() < 0.05" gqt -p -g b VCF In F V In VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in g Individual-centric (GQT, BGT) Individuals Variants Variants Individuals SELECT  *      VARIANT  gene="TP53"  AND  impact="HIGH"      SAMPLE  affected  IS  (ancestry="EA"                                                AND  phenotype=2                                              AND  BMI>35)      GENOTYPE  affected.MAF()>0.05
  18. Future: A Genome Query Language? Variant-centric (bcftools, BGT) + =

    General Genome Query Language (based on discussions w/ Heng Li) VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in cases and rare in controls. b gqt query study.gqt study.db -p "phenotype == 2" -g "maf() > 0.05" -p "phenotype == 1" -g "maf() < 0.05" gqt -p -g b VCF In F V In VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in g Individual-centric (GQT, BGT) Individuals Variants Variants Individuals SELECT  *      VARIANT  gene="TP53"  AND  impact="HIGH"      SAMPLE  affected  IS  (ancestry="EA"                                                AND  phenotype=2                                              AND  BMI>35)      GENOTYPE  affected.MAF()>0.05
  19. Future: A Genome Query Language? Variant-centric (bcftools, BGT) + =

    General Genome Query Language (based on discussions w/ Heng Li) VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in cases and rare in controls. b gqt query study.gqt study.db -p "phenotype == 2" -g "maf() > 0.05" -p "phenotype == 1" -g "maf() < 0.05" gqt -p -g b VCF In F V In VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in g Individual-centric (GQT, BGT) Individuals Variants Variants Individuals SELECT  *      VARIANT  gene="TP53"  AND  impact="HIGH"      SAMPLE  affected  IS  (ancestry="EA"                                                AND  phenotype=2                                              AND  BMI>35)      GENOTYPE  affected.MAF()>0.05