EVERY STEP of the analysis BEFORE ne-tuning each. Try to imagine what the result needs to look like and work towards that goal. Think of an artist drawing a portrait. It is a successive re nement of the full image.
the "surrounding" decisions. Make your data smaller. Don't get hung up on details. Keep forging ahead. You have to see the "END" to make a good decision at the "BEGINNING."
Invented at the Broad Institute for the 1000 genomes project. An attempt to "properly" describe differences: ATGC AAGC We could report this as: POS REF ALT 2 T A At position 2 base T changed to A “ “
Allele(s) Works well for simple variants. Gets unexpectedly complicated when the variants are not so simple. Multiple ways (variants) could produce the same outcome. The variants could "look" different whereas the results could be the same.
ATGT 2 C T Deletion (includes the rst unchanged base): ACGT POS REF ALT A--T 1 ACG A Insertion (includes the rst unchanged base): AC-GT POS REF ALT ACTGT 2 C CT
ALT A-TT 1 ACG AT Large structural variants (more elds are needed) POS REF ALT INFO 100 T <DEL> SVTYPE=DEL;END=300 See link to the VCF Poster in the Handbook.
variants correctly, the VCF stuffs tons of other information into the le. Meant to describe the variant calling decisions. Meant to help you decide which variant to "trust." You could select only variants with a coverage above a threshold. Or you could choose just variants that are homozygous. ... etc ... you could have an entire wishlist ...
of data. Body: Contains nine columns that describe each variant. CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Then additional columns are added for each sample.
##FORMAT= <ID=DP,Type=Integer,Description="Read Depth"> Then in the body, you could have CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1 1 . ACG A,AT . PASS . GT:DP
line is a single variant that summarized across all samples: The SAMPLE matches the FORMAT . The GT matches the REF/ALT Practice reading this out. It can be unexpectedly challenging even if you understand the format. CHROM POS ID REF ALT ... FORMAT SAMPLE1 SAMPLE2 ... 1 1 . ACG A,AT ... GT:DP 1/2:13 0/0:29 ...
1 at position 1 we may observe two alleles: ACG -> A,AT . Either a deletion of CG or a deletion of C followed by a mismatch of G to T SAMPLE 1 is heterozygous 1/2 where one allele is the CG deletion, the other allele is second variant explained above. 13 reads cover this position. SAMPLE 2 is homozygous 0/0 and matches the reference. 29 reads cover the position. CHROM POS ID REF ALT ... FORMAT SAMPLE1 SAMPLE2 ... 1 1 . ACG A,AT ... GT:DP 1/2:13 0/0:29 ...
times no, sometimes yes. If you are processing variants by the thousands or more, then you'd be using tools and would never read the VCF by eye. But when it is essential to know if one, speci c variant is indeed accurate and valid you probably would need to understand the details of the VCF.
about this the wrong way. We should be comparing not variants but nal "products." Today we look up one variant against another. The variant comparison works in simple cases. What about the rest? Two sequences are identical not when they "vary" the same way, but when they are the same. I admit it is easy to say the approach is awed. It is less clear how to solve it.
premature optimization. Not surprisingly, invented at the same place... It was created too soon, and for speci c use cases. Adopted too widely. You may need a software tool to gure out if two variants are the same or not! You need to understand VCF because the vast majority of human variation related information is locked away in a VCF format.