Making queries of the genome less difficult.

91f1e43339bdc1bd3690295bfaeeb17e?s=47 Aaron Quinlan
October 31, 2015

Making queries of the genome less difficult.

Presented on 30-Oct-2105 in the "Sequencing pipelines and assembly" session at the CSHL Genome Informatics Meeting.

91f1e43339bdc1bd3690295bfaeeb17e?s=128

Aaron Quinlan

October 31, 2015
Tweet

Transcript

  1. 1.

    Aaron Quinlan University of Utah quinlanlab.org @aaronquinlan ! ! !

    ! Making queries of the genome less difficult.
  2. 4.

    • inconsistent chromosome labels. • different sorting criteria. • mixed

    UNIX/Windows newlines. • file violates spec with vigor. • program expects exact extension. • file is gzipp’ed, not bgzipp’ed. • annotations use diff. genome builds. • tool only works for one format. • tool is hard-coded for specific build. • tool requires act of gods to compile.
  3. 5.

    vcfanno will annotate your VCF with panache. Naked VCF vcfanno

      +   configuration   file VCF w/ annotations in INFO field Brent Pedersen https://github.com/brentp/vcfanno VCF, BED, GFF, BAM, (soon BW)
  4. 6.

    [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file.
  5. 7.

    [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file. Match on POS+REF+ALT for VCF annotations.
  6. 8.

    [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file. Allows multiple annotations from each file Match on POS+REF+ALT for VCF annotations.
  7. 9.

    [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file. Allows multiple annotations from each file Can rename the annotations in the resulting VCF Match on POS+REF+ALT for VCF annotations.
  8. 10.

    [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file. Allows multiple annotations from each file Can rename the annotations in the resulting VCF Multiple operations to summarize the results of multiple hits in annot. file: mean,  max,  min   concat,  count,  uniq   first,  flag Match on POS+REF+ALT for VCF annotations.
  9. 11.

    [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file. Allows multiple annotations from each file Can rename the annotations in the resulting VCF Multiple operations to summarize the results of multiple hits in annot. file: mean,  max,  min   concat,  count,  uniq   first,  flag Match on POS+REF+ALT for VCF annotations. Javascript for custom computations. variance() defined in custom.js
  10. 12.

    before and after vcfanno vcfanno AC=11;AF=0.017 AC=11;AF=0.017;   exac_aaf=0.0012;  

    exac_num_het=8;   rs_ids=1234;   gerp_mean=7.25e-­‐07   gerp_var=1.39e-­‐08 Naked VCF Dressed VCF [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]   ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"]
  11. 14.

    Individual-centric queries with Genotype Query Tools (GQT) github.com/ryanlayer/gqt In press.

    Ryan Layer http://biorxiv.org/content/early/2015/04/20/018259
  12. 16.
  13. 17.

    bcftools  view  \   -­‐r  17:43044295-­‐43125483  \   1000g.vcf  

                                       OR   ! tabix  1000g.vcf  17:43044295-­‐43125483   ! Existing tools handle variant-centric queries well
  14. 19.
  15. 22.

    Idea: transpose the genotype matrix G GT Note: other tricks

    included for speed/compression, please see manuscript
  16. 25.
  17. 26.
  18. 27.
  19. 28.
  20. 29.
  21. 30.
  22. 33.
  23. 35.

    Bitmap indices of variant metadata (VEP consequence) VEP consequence bitmap:

    1   0   0   0   0   0   0   …   0 synon. missense 0   0   0   0   0   0   1   …   0 stopgain 0   0   0   0   0   0   0   …   1 splice 0   0   0   0   1   0   0   …   0 .  .  .
  24. 37.

    Bitmap indices of genotype metadata (depth) Genotype depth bitmap 1

      0   0   0   0   0   0   …   0 0 1 0   1   0   0   0   0   0   …   0 2 0   0   0   0   0   0   1   …   0 3 0   0   0   1   0   0   0   …   0 10 0   0   0   0   0   0   0   …   1 20 0   0   0   0   0   1   0   …   0 25-30 0   0   0   0   1   0   0   …   0 >30 0   0   1   0   0   0   0   …   0 Ongoing: how to optimize lossiness of quantization?
  25. 38.

    Future: A Genome Query Language? Variant-centric (bcftools, BGT) + =

    General Genome Query Language (based on discussions w/ Heng Li) VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in cases and rare in controls. b gqt query study.gqt study.db -p "phenotype == 2" -g "maf() > 0.05" -p "phenotype == 1" -g "maf() < 0.05" gqt -p -g b VCF In F V In VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in g Individual-centric (GQT, BGT) Individuals Variants Variants Individuals
  26. 39.

    Future: A Genome Query Language? Variant-centric (bcftools, BGT) + =

    General Genome Query Language (based on discussions w/ Heng Li) VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in cases and rare in controls. b gqt query study.gqt study.db -p "phenotype == 2" -g "maf() > 0.05" -p "phenotype == 1" -g "maf() < 0.05" gqt -p -g b VCF In F V In VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in g Individual-centric (GQT, BGT) Individuals Variants Variants Individuals SELECT  *      VARIANT  gene="TP53"  AND  impact="HIGH"      SAMPLE  affected  IS  (ancestry="EA"                                                AND  phenotype=2                                              AND  BMI>35)      GENOTYPE  affected.MAF()>0.05
  27. 40.

    Future: A Genome Query Language? Variant-centric (bcftools, BGT) + =

    General Genome Query Language (based on discussions w/ Heng Li) VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in cases and rare in controls. b gqt query study.gqt study.db -p "phenotype == 2" -g "maf() > 0.05" -p "phenotype == 1" -g "maf() < 0.05" gqt -p -g b VCF In F V In VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in g Individual-centric (GQT, BGT) Individuals Variants Variants Individuals SELECT  *      VARIANT  gene="TP53"  AND  impact="HIGH"      SAMPLE  affected  IS  (ancestry="EA"                                                AND  phenotype=2                                              AND  BMI>35)      GENOTYPE  affected.MAF()>0.05
  28. 41.

    Future: A Genome Query Language? Variant-centric (bcftools, BGT) + =

    General Genome Query Language (based on discussions w/ Heng Li) VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in cases and rare in controls. b gqt query study.gqt study.db -p "phenotype == 2" -g "maf() > 0.05" -p "phenotype == 1" -g "maf() < 0.05" gqt -p -g b VCF In F V In VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in g Individual-centric (GQT, BGT) Individuals Variants Variants Individuals SELECT  *      VARIANT  gene="TP53"  AND  impact="HIGH"      SAMPLE  affected  IS  (ancestry="EA"                                                AND  phenotype=2                                              AND  BMI>35)      GENOTYPE  affected.MAF()>0.05
  29. 43.