Algo/tools Results Synthetic datasets Mouse datasets Sampled Human Dataset * Craig Venture’s genome (HuRef) * HuRef + Variants > Diploid sample genome * Simulated Illumina reads (simNGS) * Contaminated genome (HuRef reads + James Watson’s genome reads) * Noiseless Ground Truth data (VCF files) * Canonical Mouse Ref + Mouse Genome Project * Paired end reads B6 Strain, VCF derived from 2 References * Shorts reads contaminated with Human genome (same sequencing tech) * Well studied human genomes European female, Nigerian male, Nigerian Female Evaluation Metrics * Accuracy recall = a validated variant precision = probability that a called variant is correct * Performance - Hours per genome - Dollars for genome Tools Evaluated * SNV Gatk, mpileup * Structural Variants Pindel, Breakdancer Computational Performance * Amazon AWS platform * GATK memory usage fluctuates * GATK requires large amount of disk space * BreakDancer’s output more compact than VCF , requires small space