Pro Yearly is on sale from $80 to $50! »

Making queries of the genome less difficult.

91f1e43339bdc1bd3690295bfaeeb17e?s=47 Aaron Quinlan
October 31, 2015

Making queries of the genome less difficult.

Presented on 30-Oct-2105 in the "Sequencing pipelines and assembly" session at the CSHL Genome Informatics Meeting.

91f1e43339bdc1bd3690295bfaeeb17e?s=128

Aaron Quinlan

October 31, 2015
Tweet

Transcript

  1. Aaron Quinlan University of Utah quinlanlab.org @aaronquinlan ! ! !

    ! Making queries of the genome less difficult.
  2. ...CCTCATGCATGGAAA... Genetic variation ...CCTCATGTATGGAAA... ...CCTCATGCATGGAAA... ...CCTCATGCATGGAAA... ...CCTCATGTATGGAAA... ...CCTCATGCATGGAAA... ...CCTCATGTATGGAAA... Variant

    prioritization requires context.
  3. ...CCTCATGCATGGAAA... Genetic variation ...CCTCATGTATGGAAA... ...CCTCATGCATGGAAA... ...CCTCATGCATGGAAA... ...CCTCATGTATGGAAA... ...CCTCATGCATGGAAA... ...CCTCATGTATGGAAA... Chromatin

    marks DNA methylation RNA expression TF binding Variant prioritization requires context.
  4. • inconsistent chromosome labels. • different sorting criteria. • mixed

    UNIX/Windows newlines. • file violates spec with vigor. • program expects exact extension. • file is gzipp’ed, not bgzipp’ed. • annotations use diff. genome builds. • tool only works for one format. • tool is hard-coded for specific build. • tool requires act of gods to compile.
  5. vcfanno will annotate your VCF with panache. Naked VCF vcfanno

      +   configuration   file VCF w/ annotations in INFO field Brent Pedersen https://github.com/brentp/vcfanno VCF, BED, GFF, BAM, (soon BW)
  6. [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file.
  7. [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file. Match on POS+REF+ALT for VCF annotations.
  8. [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file. Allows multiple annotations from each file Match on POS+REF+ALT for VCF annotations.
  9. [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file. Allows multiple annotations from each file Can rename the annotations in the resulting VCF Match on POS+REF+ALT for VCF annotations.
  10. [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file. Allows multiple annotations from each file Can rename the annotations in the resulting VCF Multiple operations to summarize the results of multiple hits in annot. file: mean,  max,  min   concat,  count,  uniq   first,  flag Match on POS+REF+ALT for VCF annotations.
  11. [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]  

    ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"] vcfanno configuration file. Allows multiple annotations from each file Can rename the annotations in the resulting VCF Multiple operations to summarize the results of multiple hits in annot. file: mean,  max,  min   concat,  count,  uniq   first,  flag Match on POS+REF+ALT for VCF annotations. Javascript for custom computations. variance() defined in custom.js
  12. before and after vcfanno vcfanno AC=11;AF=0.017 AC=11;AF=0.017;   exac_aaf=0.0012;  

    exac_num_het=8;   rs_ids=1234;   gerp_mean=7.25e-­‐07   gerp_var=1.39e-­‐08 Naked VCF Dressed VCF [[annotation]]   file=“ExAC.v3.vcf”   fields=[“AF”,  “AC_Het”]   names=[“exac_aaf”,  “exac_num_het”]   ops=[“first”,  “first”]   ! [[annotation]]   file="dbsnp.b141.vcf.gz"   fields=["ID"]   names=["rs_ids"]   ops=[“concat"]   ! [[annotation]]   file="gerp.elements.bed.gz"   columns=[4,4]   names=[“gerp_mean”,”gerp_var”]   ops=[“mean”,  "js:variance(vals)"]
  13. New parallel “chromsweep”. vcfanno is speedy. 18 annotations: 29K variants

    / sec @ 12 cores See poster 160 for details
  14. Individual-centric queries with Genotype Query Tools (GQT) github.com/ryanlayer/gqt In press.

    Ryan Layer http://biorxiv.org/content/early/2015/04/20/018259
  15. A variant-centric query: Which variants affect BRCA1?

  16. None
  17. bcftools  view  \   -­‐r  17:43044295-­‐43125483  \   1000g.vcf  

                                       OR   ! tabix  1000g.vcf  17:43044295-­‐43125483   ! Existing tools handle variant-centric queries well
  18. An individual-centric query: In which variants are all affected males

    heterozygous?
  19. None
  20. In which variants are all affected males heterozygous?

  21. In which variants are all affected males heterozygous?

  22. Idea: transpose the genotype matrix G GT Note: other tricks

    included for speed/compression, please see manuscript
  23. In which variants are all affected males heterozygous?

  24. Affected males In which variants are all affected males heterozygous?

  25. None
  26. None
  27. None
  28. None
  29. None
  30. None
  31. In which variants are all affected males heterozygous?

  32. In which variants are all affected males heterozygous?

  33. None
  34. Great, but what about indexing variant and genotype metadata?

  35. Bitmap indices of variant metadata (VEP consequence) VEP consequence bitmap:

    1   0   0   0   0   0   0   …   0 synon. missense 0   0   0   0   0   0   1   …   0 stopgain 0   0   0   0   0   0   0   …   1 splice 0   0   0   0   1   0   0   …   0 .  .  .
  36. Bitmap indices of genotype metadata (depth) Ongoing: how to optimize

    lossiness of quantization?
  37. Bitmap indices of genotype metadata (depth) Genotype depth bitmap 1

      0   0   0   0   0   0   …   0 0 1 0   1   0   0   0   0   0   …   0 2 0   0   0   0   0   0   1   …   0 3 0   0   0   1   0   0   0   …   0 10 0   0   0   0   0   0   0   …   1 20 0   0   0   0   0   1   0   …   0 25-30 0   0   0   0   1   0   0   …   0 >30 0   0   1   0   0   0   0   …   0 Ongoing: how to optimize lossiness of quantization?
  38. Future: A Genome Query Language? Variant-centric (bcftools, BGT) + =

    General Genome Query Language (based on discussions w/ Heng Li) VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in cases and rare in controls. b gqt query study.gqt study.db -p "phenotype == 2" -g "maf() > 0.05" -p "phenotype == 1" -g "maf() < 0.05" gqt -p -g b VCF In F V In VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in g Individual-centric (GQT, BGT) Individuals Variants Variants Individuals
  39. Future: A Genome Query Language? Variant-centric (bcftools, BGT) + =

    General Genome Query Language (based on discussions w/ Heng Li) VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in cases and rare in controls. b gqt query study.gqt study.db -p "phenotype == 2" -g "maf() > 0.05" -p "phenotype == 1" -g "maf() < 0.05" gqt -p -g b VCF In F V In VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in g Individual-centric (GQT, BGT) Individuals Variants Variants Individuals SELECT  *      VARIANT  gene="TP53"  AND  impact="HIGH"      SAMPLE  affected  IS  (ancestry="EA"                                                AND  phenotype=2                                              AND  BMI>35)      GENOTYPE  affected.MAF()>0.05
  40. Future: A Genome Query Language? Variant-centric (bcftools, BGT) + =

    General Genome Query Language (based on discussions w/ Heng Li) VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in cases and rare in controls. b gqt query study.gqt study.db -p "phenotype == 2" -g "maf() > 0.05" -p "phenotype == 1" -g "maf() < 0.05" gqt -p -g b VCF In F V In VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in g Individual-centric (GQT, BGT) Individuals Variants Variants Individuals SELECT  *      VARIANT  gene="TP53"  AND  impact="HIGH"      SAMPLE  affected  IS  (ancestry="EA"                                                AND  phenotype=2                                              AND  BMI>35)      GENOTYPE  affected.MAF()>0.05
  41. Future: A Genome Query Language? Variant-centric (bcftools, BGT) + =

    General Genome Query Language (based on discussions w/ Heng Li) VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in cases and rare in controls. b gqt query study.gqt study.db -p "phenotype == 2" -g "maf() > 0.05" -p "phenotype == 1" -g "maf() < 0.05" gqt -p -g b VCF In F V In VCF A B PED SQL database GQT index Individuals Variants 3 4 5 6 9 gqt convert ped gqt convert vcf D C Find variants that are common in g Individual-centric (GQT, BGT) Individuals Variants Variants Individuals SELECT  *      VARIANT  gene="TP53"  AND  impact="HIGH"      SAMPLE  affected  IS  (ancestry="EA"                                                AND  phenotype=2                                              AND  BMI>35)      GENOTYPE  affected.MAF()>0.05
  42. Thank you! Funding: Brent Pedersen Ryan Layer Jim Havrilla

  43. Students and Postdocs wanted. This could be you. Note: this

    is not me. aaronquinlan@gmail.com