PAGXXVI combined presentations from Genome Informatics section outlining the trio binning strategy to produce complete haplotypes from a single genome.
What is wrong with inbred genomes? } Incomplete inbreeding } Heterozygosity important for fitness } Mixture of homozygous and collapsed heterozygous regions } Incomplete phasing } No association of blocks to a haplotype } Short phase blocks } Missed diverged heterozygous regions
} Megabubbles } Variants output separately } Phased but short } Homozygous regions are single-copy } Falcon associated “haplotigs” report only one half of bubble Variant Terminology https://support.10xgenomics.com/de-novo-assembly/software/pipelines/latest/output/generating } Pseudohaplotypes } Random path through variants } Not phased but long } Falcon primary contigs are an example } Haplotigs } Consistent path through each haplotype } Homozygous regions represented twice } Each set of haplotigs is a complete representation of a single haplotype
Classification with sequencing error Pick minimum k-mer given genome size to avoid random collision to maximize survival } K-mers sensitive to SVs and SNPs } Each SNP == k k-mers 0.001 0.005 0.020 0.100 10 100 1000 10000 100000 Marker density Read length 14% 12% 8% 4% Error % Human A. thaliana } Expect } 90% confidence reads ≥ 5 kbp have at least one k-mer } Observe } 87.4% of all bases } avg read length 12 kbp } 90% of all bases >= 5kbp
MHC Comparison 10X average edit distance: 45.25 bp, TrioBinning average edit distance: 0.1 bp Pseudohap1 (paternal) Pseudohap2 P P P M M P ? M M M M P ? ? Hap1 Hap1I Maternal Paternal Supernova Trio Binning
What do you miss with a poor reference? } UMD3 vs Nelore (B. indicus) } No variants >200 bp • UMD3 vs Brahman (maternal) • No variants > 1kbp • Father (B. taurus) vs Mother (B. indicus) • Complete profile LINE tRNA-Core-RTE (BovA) RTE-BovB
% of chromosome • Counting variations shared in both Brahman and Angus (<50kb) • 3,178 inversions shared in Brahman and Angus haplotype (mean size 9.5 kb) • 2~6% of each chromosome will be lost • Discrepancy mostly goes away when comparing to the latest ARS19 Errors are common in UMD reference
} Gene annotation } Lifted over 28,556 UMD3 RefSeq genes downloaded from BovineGenome.org } Genes in Angus assembly } 16,434 genes completely lifted over } 8,406 / 8,466 genes healed from gaps } Genes on chrY not lifted over } Genes in Brahman assembly } 18,105 genes completely lifted over } 9,366 / 9,401 genes healed from gaps } Heterozygosity (%) } Measuring SNPs, short INDELs, SVs when comparing Brahman and Angus assemblies } For each variation called in Brahman (D) and Angus (S); } Heterozygosity = 100 x { ∑max(D, S) / (1M + e ) } } where e = max(D, S) – min(D,S), extra sequence not in the 1M frame Brahman Angus 1 M e D S Measuring heterozygosity
} No inbreeding is ever perfect } Time consuming } Wrong strategy } Select most outbred individual along with parents to improve haplotype resolution } Get two full haplotypes phased across full genome } Greater continuity than assembling without trio information with sufficient coverage } Minimal additional cost of two Illumina libraries } Can also work with ancestral/survey data } Limited in regions of parent and child homozygosity (e.g. 0/1 genotype in all) } Trio approach cannot resolve unless spanned by reads ¨ Select more outbred individual ¨ Sequence with longer reads } Sequence/assembler agnostic } Polish/gap-fill as before using haplotype-assigned sequences } Combine with Hi-C to get haplotype resolved chromosomes A new strategy to generate references?
Acknowledgements genomeinformatics.github.io } Adam Phillippy } Sergey Koren } Arang Rhie } Brian Walenz } Alexander Dilthey } Brian Ondov canu.readthedocs.io } Adam Phillippy } Sergey Koren } Brian Walenz } Konstantin Berlin } Jason Miller } Cow F1 collaborators } Tim Smith } John Williams } Sarah Kingan