$30 off During Our Annual Pro Sale. View Details »

Making queries of the genome less difficult.

Aaron Quinlan
October 31, 2015

Making queries of the genome less difficult.

Presented on 30-Oct-2105 in the "Sequencing pipelines and assembly" session at the CSHL Genome Informatics Meeting.

Aaron Quinlan

October 31, 2015
Tweet

More Decks by Aaron Quinlan

Other Decks in Science

Transcript

  1. Aaron Quinlan
    University of Utah
    quinlanlab.org
    @aaronquinlan
    !
    !
    !
    !
    Making queries of the genome less difficult.

    View Slide

  2. ...CCTCATGCATGGAAA...
    Genetic variation
    ...CCTCATGTATGGAAA...
    ...CCTCATGCATGGAAA...
    ...CCTCATGCATGGAAA...
    ...CCTCATGTATGGAAA...
    ...CCTCATGCATGGAAA...
    ...CCTCATGTATGGAAA...
    Variant prioritization requires context.

    View Slide

  3. ...CCTCATGCATGGAAA...
    Genetic variation
    ...CCTCATGTATGGAAA...
    ...CCTCATGCATGGAAA...
    ...CCTCATGCATGGAAA...
    ...CCTCATGTATGGAAA...
    ...CCTCATGCATGGAAA...
    ...CCTCATGTATGGAAA...
    Chromatin marks
    DNA methylation
    RNA expression
    TF binding
    Variant prioritization requires context.

    View Slide

  4. • inconsistent chromosome labels.
    • different sorting criteria.
    • mixed UNIX/Windows newlines.
    • file violates spec with vigor.
    • program expects exact extension.
    • file is gzipp’ed, not bgzipp’ed.
    • annotations use diff. genome builds.
    • tool only works for one format.
    • tool is hard-coded for specific build.
    • tool requires act of gods to compile.

    View Slide

  5. vcfanno will annotate your VCF with panache.
    Naked
    VCF
    vcfanno  
    +  
    configuration  
    file
    VCF
    w/ annotations
    in INFO field
    Brent Pedersen
    https://github.com/brentp/vcfanno
    VCF, BED,
    GFF, BAM, (soon BW)

    View Slide

  6. [[annotation]]  
    file=“ExAC.v3.vcf”  
    fields=[“AF”,  “AC_Het”]  
    names=[“exac_aaf”,  “exac_num_het”]  
    ops=[“first”,  “first”]  
    !
    [[annotation]]  
    file="dbsnp.b141.vcf.gz"  
    fields=["ID"]  
    names=["rs_ids"]  
    ops=[“concat"]  
    !
    [[annotation]]  
    file="gerp.elements.bed.gz"  
    columns=[4,4]  
    names=[“gerp_mean”,”gerp_var”]  
    ops=[“mean”,  "js:variance(vals)"]
    vcfanno configuration file.

    View Slide

  7. [[annotation]]  
    file=“ExAC.v3.vcf”  
    fields=[“AF”,  “AC_Het”]  
    names=[“exac_aaf”,  “exac_num_het”]  
    ops=[“first”,  “first”]  
    !
    [[annotation]]  
    file="dbsnp.b141.vcf.gz"  
    fields=["ID"]  
    names=["rs_ids"]  
    ops=[“concat"]  
    !
    [[annotation]]  
    file="gerp.elements.bed.gz"  
    columns=[4,4]  
    names=[“gerp_mean”,”gerp_var”]  
    ops=[“mean”,  "js:variance(vals)"]
    vcfanno configuration file.
    Match on POS+REF+ALT
    for VCF annotations.

    View Slide

  8. [[annotation]]  
    file=“ExAC.v3.vcf”  
    fields=[“AF”,  “AC_Het”]  
    names=[“exac_aaf”,  “exac_num_het”]  
    ops=[“first”,  “first”]  
    !
    [[annotation]]  
    file="dbsnp.b141.vcf.gz"  
    fields=["ID"]  
    names=["rs_ids"]  
    ops=[“concat"]  
    !
    [[annotation]]  
    file="gerp.elements.bed.gz"  
    columns=[4,4]  
    names=[“gerp_mean”,”gerp_var”]  
    ops=[“mean”,  "js:variance(vals)"]
    vcfanno configuration file.
    Allows multiple annotations
    from each file
    Match on POS+REF+ALT
    for VCF annotations.

    View Slide

  9. [[annotation]]  
    file=“ExAC.v3.vcf”  
    fields=[“AF”,  “AC_Het”]  
    names=[“exac_aaf”,  “exac_num_het”]  
    ops=[“first”,  “first”]  
    !
    [[annotation]]  
    file="dbsnp.b141.vcf.gz"  
    fields=["ID"]  
    names=["rs_ids"]  
    ops=[“concat"]  
    !
    [[annotation]]  
    file="gerp.elements.bed.gz"  
    columns=[4,4]  
    names=[“gerp_mean”,”gerp_var”]  
    ops=[“mean”,  "js:variance(vals)"]
    vcfanno configuration file.
    Allows multiple annotations
    from each file
    Can rename the annotations
    in the resulting VCF
    Match on POS+REF+ALT
    for VCF annotations.

    View Slide

  10. [[annotation]]  
    file=“ExAC.v3.vcf”  
    fields=[“AF”,  “AC_Het”]  
    names=[“exac_aaf”,  “exac_num_het”]  
    ops=[“first”,  “first”]  
    !
    [[annotation]]  
    file="dbsnp.b141.vcf.gz"  
    fields=["ID"]  
    names=["rs_ids"]  
    ops=[“concat"]  
    !
    [[annotation]]  
    file="gerp.elements.bed.gz"  
    columns=[4,4]  
    names=[“gerp_mean”,”gerp_var”]  
    ops=[“mean”,  "js:variance(vals)"]
    vcfanno configuration file.
    Allows multiple annotations
    from each file
    Can rename the annotations
    in the resulting VCF
    Multiple operations to
    summarize the results
    of multiple hits in annot. file:
    mean,  max,  min  
    concat,  count,  uniq  
    first,  flag
    Match on POS+REF+ALT
    for VCF annotations.

    View Slide

  11. [[annotation]]  
    file=“ExAC.v3.vcf”  
    fields=[“AF”,  “AC_Het”]  
    names=[“exac_aaf”,  “exac_num_het”]  
    ops=[“first”,  “first”]  
    !
    [[annotation]]  
    file="dbsnp.b141.vcf.gz"  
    fields=["ID"]  
    names=["rs_ids"]  
    ops=[“concat"]  
    !
    [[annotation]]  
    file="gerp.elements.bed.gz"  
    columns=[4,4]  
    names=[“gerp_mean”,”gerp_var”]  
    ops=[“mean”,  "js:variance(vals)"]
    vcfanno configuration file.
    Allows multiple annotations
    from each file
    Can rename the annotations
    in the resulting VCF
    Multiple operations to
    summarize the results
    of multiple hits in annot. file:
    mean,  max,  min  
    concat,  count,  uniq  
    first,  flag
    Match on POS+REF+ALT
    for VCF annotations.
    Javascript for
    custom computations.
    variance() defined in
    custom.js

    View Slide

  12. before and after vcfanno
    vcfanno
    AC=11;AF=0.017
    AC=11;AF=0.017;  
    exac_aaf=0.0012;  
    exac_num_het=8;  
    rs_ids=1234;  
    gerp_mean=7.25e-­‐07  
    gerp_var=1.39e-­‐08
    Naked
    VCF
    Dressed
    VCF
    [[annotation]]  
    file=“ExAC.v3.vcf”  
    fields=[“AF”,  “AC_Het”]  
    names=[“exac_aaf”,  “exac_num_het”]  
    ops=[“first”,  “first”]  
    !
    [[annotation]]  
    file="dbsnp.b141.vcf.gz"  
    fields=["ID"]  
    names=["rs_ids"]  
    ops=[“concat"]  
    !
    [[annotation]]  
    file="gerp.elements.bed.gz"  
    columns=[4,4]  
    names=[“gerp_mean”,”gerp_var”]  
    ops=[“mean”,  "js:variance(vals)"]

    View Slide

  13. New parallel “chromsweep”. vcfanno is speedy.
    18 annotations:
    29K variants / sec @ 12 cores
    See poster 160 for details

    View Slide

  14. Individual-centric queries with
    Genotype Query Tools (GQT)
    github.com/ryanlayer/gqt
    In press.
    Ryan Layer
    http://biorxiv.org/content/early/2015/04/20/018259

    View Slide

  15. A variant-centric query:
    Which variants affect BRCA1?

    View Slide

  16. View Slide

  17. bcftools  view  \  
    -­‐r  17:43044295-­‐43125483  \  
    1000g.vcf    
                                     OR  
    !
    tabix  1000g.vcf  17:43044295-­‐43125483  
    !
    Existing tools handle variant-centric
    queries well

    View Slide

  18. An individual-centric query:
    In which variants are all affected
    males heterozygous?

    View Slide

  19. View Slide

  20. In which variants are all affected males heterozygous?

    View Slide

  21. In which variants are all affected males heterozygous?

    View Slide

  22. Idea: transpose the genotype matrix
    G GT
    Note: other tricks included for speed/compression,
    please see manuscript

    View Slide

  23. In which variants are all affected males heterozygous?

    View Slide

  24. Affected
    males
    In which variants are all affected males heterozygous?

    View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. View Slide

  29. View Slide

  30. View Slide

  31. In which variants are all affected males heterozygous?

    View Slide

  32. In which variants are all affected males heterozygous?

    View Slide

  33. View Slide

  34. Great, but what about
    indexing variant and
    genotype metadata?

    View Slide

  35. Bitmap indices of variant metadata (VEP consequence)
    VEP consequence bitmap:
    1  
    0  
    0  
    0  
    0  
    0  
    0  
    …  
    0
    synon. missense
    0  
    0  
    0  
    0  
    0  
    0  
    1  
    …  
    0
    stopgain
    0  
    0  
    0  
    0  
    0  
    0  
    0  
    …  
    1
    splice
    0  
    0  
    0  
    0  
    1  
    0  
    0  
    …  
    0
    .  .  .

    View Slide

  36. Bitmap indices of genotype metadata (depth)
    Ongoing: how to optimize lossiness of quantization?

    View Slide

  37. Bitmap indices of genotype metadata (depth)
    Genotype depth bitmap
    1  
    0  
    0  
    0  
    0  
    0  
    0  
    …  
    0
    0 1
    0  
    1  
    0  
    0  
    0  
    0  
    0  
    …  
    0
    2
    0  
    0  
    0  
    0  
    0  
    0  
    1  
    …  
    0
    3
    0  
    0  
    0  
    1  
    0  
    0  
    0  
    …  
    0
    10
    0  
    0  
    0  
    0  
    0  
    0  
    0  
    …  
    1
    20
    0  
    0  
    0  
    0  
    0  
    1  
    0  
    …  
    0
    25-30
    0  
    0  
    0  
    0  
    1  
    0  
    0  
    …  
    0
    >30
    0  
    0  
    1  
    0  
    0  
    0  
    0  
    …  
    0
    Ongoing: how to optimize lossiness of quantization?

    View Slide

  38. Future: A Genome Query Language?
    Variant-centric
    (bcftools, BGT)
    + =
    General
    Genome
    Query Language
    (based on discussions w/
    Heng Li)
    VCF
    A B
    PED
    SQL database
    GQT index
    Individuals
    Variants
    3 4
    5 6 9
    gqt convert ped
    gqt convert vcf
    D
    C Find variants that are common in
    cases and rare in controls. b
    gqt query study.gqt study.db
    -p "phenotype == 2"
    -g "maf() > 0.05"
    -p "phenotype == 1"
    -g "maf() < 0.05"
    gqt
    -p
    -g
    b
    VCF
    In
    F
    V
    In
    VCF
    A B
    PED
    SQL database
    GQT index
    Individuals
    Variants
    3 4
    5 6 9
    gqt convert ped
    gqt convert vcf
    D
    C Find variants that are common in
    g
    Individual-centric
    (GQT, BGT)
    Individuals
    Variants
    Variants
    Individuals

    View Slide

  39. Future: A Genome Query Language?
    Variant-centric
    (bcftools, BGT)
    + =
    General
    Genome
    Query Language
    (based on discussions w/
    Heng Li)
    VCF
    A B
    PED
    SQL database
    GQT index
    Individuals
    Variants
    3 4
    5 6 9
    gqt convert ped
    gqt convert vcf
    D
    C Find variants that are common in
    cases and rare in controls. b
    gqt query study.gqt study.db
    -p "phenotype == 2"
    -g "maf() > 0.05"
    -p "phenotype == 1"
    -g "maf() < 0.05"
    gqt
    -p
    -g
    b
    VCF
    In
    F
    V
    In
    VCF
    A B
    PED
    SQL database
    GQT index
    Individuals
    Variants
    3 4
    5 6 9
    gqt convert ped
    gqt convert vcf
    D
    C Find variants that are common in
    g
    Individual-centric
    (GQT, BGT)
    Individuals
    Variants
    Variants
    Individuals
    SELECT  *  
       VARIANT  gene="TP53"  AND  impact="HIGH"  
       SAMPLE  affected  IS  (ancestry="EA"    
                                               AND  phenotype=2  
                                               AND  BMI>35)  
       GENOTYPE  affected.MAF()>0.05

    View Slide

  40. Future: A Genome Query Language?
    Variant-centric
    (bcftools, BGT)
    + =
    General
    Genome
    Query Language
    (based on discussions w/
    Heng Li)
    VCF
    A B
    PED
    SQL database
    GQT index
    Individuals
    Variants
    3 4
    5 6 9
    gqt convert ped
    gqt convert vcf
    D
    C Find variants that are common in
    cases and rare in controls. b
    gqt query study.gqt study.db
    -p "phenotype == 2"
    -g "maf() > 0.05"
    -p "phenotype == 1"
    -g "maf() < 0.05"
    gqt
    -p
    -g
    b
    VCF
    In
    F
    V
    In
    VCF
    A B
    PED
    SQL database
    GQT index
    Individuals
    Variants
    3 4
    5 6 9
    gqt convert ped
    gqt convert vcf
    D
    C Find variants that are common in
    g
    Individual-centric
    (GQT, BGT)
    Individuals
    Variants
    Variants
    Individuals
    SELECT  *  
       VARIANT  gene="TP53"  AND  impact="HIGH"  
       SAMPLE  affected  IS  (ancestry="EA"    
                                               AND  phenotype=2  
                                               AND  BMI>35)  
       GENOTYPE  affected.MAF()>0.05

    View Slide

  41. Future: A Genome Query Language?
    Variant-centric
    (bcftools, BGT)
    + =
    General
    Genome
    Query Language
    (based on discussions w/
    Heng Li)
    VCF
    A B
    PED
    SQL database
    GQT index
    Individuals
    Variants
    3 4
    5 6 9
    gqt convert ped
    gqt convert vcf
    D
    C Find variants that are common in
    cases and rare in controls. b
    gqt query study.gqt study.db
    -p "phenotype == 2"
    -g "maf() > 0.05"
    -p "phenotype == 1"
    -g "maf() < 0.05"
    gqt
    -p
    -g
    b
    VCF
    In
    F
    V
    In
    VCF
    A B
    PED
    SQL database
    GQT index
    Individuals
    Variants
    3 4
    5 6 9
    gqt convert ped
    gqt convert vcf
    D
    C Find variants that are common in
    g
    Individual-centric
    (GQT, BGT)
    Individuals
    Variants
    Variants
    Individuals
    SELECT  *  
       VARIANT  gene="TP53"  AND  impact="HIGH"  
       SAMPLE  affected  IS  (ancestry="EA"    
                                               AND  phenotype=2  
                                               AND  BMI>35)  
       GENOTYPE  affected.MAF()>0.05

    View Slide

  42. Thank you!
    Funding:
    Brent
    Pedersen
    Ryan
    Layer
    Jim
    Havrilla

    View Slide

  43. Students and Postdocs wanted.
    This could be you.
    Note: this is not me.
    [email protected]

    View Slide