lab meeting 2014

Why this paper ? - Few efforts are done for
benchmarking variant calling algorithms - For once one tells how synthetic data is created - Benchmarking valid for alignment / variant calling - Smart ‘niche’ targeting from the authors - Performance (tool robustness) is also important - Authors concerned about reproducibility

What is benchmarked and how is it done ? Datasets
Algo/tools Results

Algo/tools Results Synthetic datasets Mouse datasets Sampled Human Dataset * Craig Venture’s genome (HuRef) * HuRef + Variants > Diploid sample genome * Simulated Illumina reads (simNGS) * Contaminated genome (HuRef reads + James Watson’s genome reads) * Noiseless Ground Truth data (VCF ﬁles) * Canonical Mouse Ref + Mouse Genome Project * Paired end reads B6 Strain, VCF derived from 2 References * Shorts reads contaminated with Human genome (same sequencing tech) * Well studied human genomes European female, Nigerian male, Nigerian Female

Algo/tools Results Synthetic datasets Mouse datasets Sampled Human Dataset * Craig Venture’s genome (HuRef) * HuRef + Variants > Diploid sample genome * Simulated Illumina reads (simNGS) * Contaminated genome (HuRef reads + James Watson’s genome reads) * Noiseless Ground Truth data (VCF ﬁles) * Canonical Mouse Ref + Mouse Genome Project * Paired end reads B6 Strain, VCF derived from 2 References * Shorts reads contaminated with Human genome (same sequencing tech) * Well studied human genomes European female, Nigerian male, Nigerian Female Evaluation Metrics * Accuracy recall = probability of calling a validated variant precision = probability that a called variant is correct * Performance - Hours per genome - Dollars for genome Tools Evaluated * SNV Gatk, mpileup * Structural Variants Pindel, Breakdancer

Algo/tools Results Synthetic datasets Mouse datasets Sampled Human Dataset * Craig Venture’s genome (HuRef) * HuRef + Variants > Diploid sample genome * Simulated Illumina reads (simNGS) * Contaminated genome (HuRef reads + James Watson’s genome reads) * Noiseless Ground Truth data (VCF ﬁles) * Canonical Mouse Ref + Mouse Genome Project * Paired end reads B6 Strain, VCF derived from 2 References * Shorts reads contaminated with Human genome (same sequencing tech) * Well studied human genomes European female, Nigerian male, Nigerian Female Evaluation Metrics * Accuracy recall = a validated variant precision = probability that a called variant is correct * Performance - Hours per genome - Dollars for genome Tools Evaluated * SNV Gatk, mpileup * Structural Variants Pindel, Breakdancer SNP calling * Gatk computationally expensive * Effect of contamination more visible on synthetic data * GATK more robust for contaminations

Algo/tools Results Synthetic datasets Mouse datasets Sampled Human Dataset * Craig Venture’s genome (HuRef) * HuRef + Variants > Diploid sample genome * Simulated Illumina reads (simNGS) * Contaminated genome (HuRef reads + James Watson’s genome reads) * Noiseless Ground Truth data (VCF ﬁles) * Canonical Mouse Ref + Mouse Genome Project * Paired end reads B6 Strain, VCF derived from 2 References * Shorts reads contaminated with Human genome (same sequencing tech) * Well studied human genomes European female, Nigerian male, Nigerian Female Evaluation Metrics * Accuracy recall = a validated variant precision = probability that a called variant is correct * Performance - Hours per genome - Dollars for genome Tools Evaluated * SNV Gatk, mpileup * Structural Variants Pindel, Breakdancer Indel calling * accuracy : GATK + Pindel >>>> mpileup * Contamination : gain in precision loss in recall * Both algorithms predicted fewer indels on the contamination sets

Algo/tools Results Synthetic datasets Mouse datasets Sampled Human Dataset * Craig Venture’s genome (HuRef) * HuRef + Variants > Diploid sample genome * Simulated Illumina reads (simNGS) * Contaminated genome (HuRef reads + James Watson’s genome reads) * Noiseless Ground Truth data (VCF ﬁles) * Canonical Mouse Ref + Mouse Genome Project * Paired end reads B6 Strain, VCF derived from 2 References * Shorts reads contaminated with Human genome (same sequencing tech) * Well studied human genomes European female, Nigerian male, Nigerian Female Evaluation Metrics * Accuracy recall = a validated variant precision = probability that a called variant is correct * Performance - Hours per genome - Dollars for genome Tools Evaluated * SNV Gatk, mpileup * Structural Variants Pindel, Breakdancer Structural Variant calling * SV accuracy much lower than SNP + Indels * Long insertions = low accuracy * Breakdancer better with sampled human datasets because in Venture and Mouse there is a lot of short structural deletions

Algo/tools Results Synthetic datasets Mouse datasets Sampled Human Dataset * Craig Venture’s genome (HuRef) * HuRef + Variants > Diploid sample genome * Simulated Illumina reads (simNGS) * Contaminated genome (HuRef reads + James Watson’s genome reads) * Noiseless Ground Truth data (VCF ﬁles) * Canonical Mouse Ref + Mouse Genome Project * Paired end reads B6 Strain, VCF derived from 2 References * Shorts reads contaminated with Human genome (same sequencing tech) * Well studied human genomes European female, Nigerian male, Nigerian Female Evaluation Metrics * Accuracy recall = a validated variant precision = probability that a called variant is correct * Performance - Hours per genome - Dollars for genome Tools Evaluated * SNV Gatk, mpileup * Structural Variants Pindel, Breakdancer Computational Performance * Amazon AWS platform * GATK memory usage ﬂuctuates * GATK requires large amount of disk space * BreakDancer’s output more compact than VCF , requires small space

What is missing in the paper / What I didn’t
like - The paper could be more valuable if other tools were involved - Performance could be proven differently - AWS is not intended to measure performance : authors tried to be fancy ? - You can tell authors are not visual ! Overall ! - More effort is needed to benchmark callers - Sounds very similar to Dream - Authors listed benchmarking efforts I didn’t know

lab meeting 2014

lab meeting 2014

Radhouane Aniba

More Decks by Radhouane Aniba

Other Decks in Research

Featured

Transcript

Why this paper ? - Few efforts are done for

What is benchmarked and how is it done ? Datasets

What is benchmarked and how is it done ? Datasets

What is benchmarked and how is it done ? Datasets

What is benchmarked and how is it done ? Datasets

What is benchmarked and how is it done ? Datasets

What is benchmarked and how is it done ? Datasets

What is benchmarked and how is it done ? Datasets

What is missing in the paper / What I didn’t