Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lecture 25: The VCF Format

Istvan Albert
November 01, 2017

Lecture 25: The VCF Format

The variant call format

Istvan Albert

November 01, 2017
Tweet

More Decks by Istvan Albert

Other Decks in Science

Transcript

  1. Reminder: Holistic Data Analysis As processes get complicated put together

    EVERY STEP of the analysis BEFORE ne-tuning each. Try to imagine what the result needs to look like and work towards that goal. Think of an artist drawing a portrait. It is a successive re nement of the full image.
  2. The more complicated the problem the most important to simplify

    the "surrounding" decisions. Make your data smaller. Don't get hung up on details. Keep forging ahead. You have to see the "END" to make a good decision at the "BEGINNING."
  3. VCF: The Variant Call Format Another day another data format.

    Invented at the Broad Institute for the 1000 genomes project. An attempt to "properly" describe differences: ATGC AAGC We could report this as: POS REF ALT 2 T A At position 2 base T changed to A “ “
  4. The primary information elds in a VCF are Position Reference

    Allele(s) Works well for simple variants. Gets unexpectedly complicated when the variants are not so simple. Multiple ways (variants) could produce the same outcome. The variants could "look" different whereas the results could be the same.
  5. The VCF Variant Representation 1 SNPs ACGT POS REF ALT

    ATGT 2 C T Deletion (includes the rst unchanged base): ACGT POS REF ALT A--T 1 ACG A Insertion (includes the rst unchanged base): AC-GT POS REF ALT ACTGT 2 C CT
  6. The VCF Variant Representation 2 Complex events: ACGT POS REF

    ALT A-TT 1 ACG AT Large structural variants (more elds are needed) POS REF ALT INFO 100 T <DEL> SVTYPE=DEL;END=300 See link to the VCF Poster in the Handbook.
  7. VCF carries more information Because it is challenging to call

    variants correctly, the VCF stuffs tons of other information into the le. Meant to describe the variant calling decisions. Meant to help you decide which variant to "trust." You could select only variants with a coverage above a threshold. Or you could choose just variants that are homozygous. ... etc ... you could have an entire wishlist ...
  8. The VCF de nition Header Body Header: Describes the format

    of data. Body: Contains nine columns that describe each variant. CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Then additional columns are added for each sample.
  9. The VCF "FORMAT" Header: ##FORMAT= <ID=GT,Type=String,Description="Genotype"> for RR,RA,AA genotypes (R=ref,A=alt)">

    ##FORMAT= <ID=DP,Type=Integer,Description="Read Depth"> Then in the body, you could have CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1 1 . ACG A,AT . PASS . GT:DP
  10. Info for the samples in columns past the 9th Each

    line is a single variant that summarized across all samples: The SAMPLE matches the FORMAT . The GT matches the REF/ALT Practice reading this out. It can be unexpectedly challenging even if you understand the format. CHROM POS ID REF ALT ... FORMAT SAMPLE1 SAMPLE2 ... 1 1 . ACG A,AT ... GT:DP 1/2:13 0/0:29 ...
  11. Is the VCF challenging to understand? True story: the creators

    of VCF have misinterpreted the format in the rst tutorial that they wrote. The (subtle) mistake went unnoticed for years!
  12. The story embedded in a single VCF line On chromosome

    1 at position 1 we may observe two alleles: ACG -> A,AT . Either a deletion of CG or a deletion of C followed by a mismatch of G to T SAMPLE 1 is heterozygous 1/2 where one allele is the CG deletion, the other allele is second variant explained above. 13 reads cover this position. SAMPLE 2 is homozygous 0/0 and matches the reference. 29 reads cover the position. CHROM POS ID REF ALT ... FORMAT SAMPLE1 SAMPLE2 ... 1 1 . ACG A,AT ... GT:DP 1/2:13 0/0:29 ...
  13. Do I have to understand the internals of VCF? Many

    times no, sometimes yes. If you are processing variants by the thousands or more, then you'd be using tools and would never read the VCF by eye. But when it is essential to know if one, speci c variant is indeed accurate and valid you probably would need to understand the details of the VCF.
  14. A Personal Opinion I think the entire eld is going

    about this the wrong way. We should be comparing not variants but nal "products." Today we look up one variant against another. The variant comparison works in simple cases. What about the rest? Two sequences are identical not when they "vary" the same way, but when they are the same. I admit it is easy to say the approach is awed. It is less clear how to solve it.
  15. Concluding thoughts VCF, like BAM, is also a product of

    premature optimization. Not surprisingly, invented at the same place... It was created too soon, and for speci c use cases. Adopted too widely. You may need a software tool to gure out if two variants are the same or not! You need to understand VCF because the vast majority of human variation related information is locked away in a VCF format.