Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PAGXXVI: TrioBinning

Sergey Koren
January 16, 2018

PAGXXVI: TrioBinning

PAGXXVI combined presentations from Genome Informatics section outlining the trio binning strategy to produce complete haplotypes from a single genome.

Sergey Koren

January 16, 2018
Tweet

More Decks by Sergey Koren

Other Decks in Research

Transcript

  1. TrioBinning: Trio-based assembly How I stopped worrying and learned to

    love the F1 Genome Informatics Section, NHGRI
  2. What is wrong with inbred genomes? } Incomplete inbreeding }

    Heterozygosity important for fitness } Mixture of homozygous and collapsed heterozygous regions } Incomplete phasing } No association of blocks to a haplotype } Short phase blocks } Missed diverged heterozygous regions
  3. } Megabubbles } Variants output separately } Phased but short

    } Homozygous regions are single-copy } Falcon associated “haplotigs” report only one half of bubble Variant Terminology https://support.10xgenomics.com/de-novo-assembly/software/pipelines/latest/output/generating } Pseudohaplotypes } Random path through variants } Not phased but long } Falcon primary contigs are an example } Haplotigs } Consistent path through each haplotype } Homozygous regions represented twice } Each set of haplotigs is a complete representation of a single haplotype
  4. 0.001 0.005 0.020 0.100 10 100 1000 10000 100000 Marker

    density Read length Trio Binning Dam (Brahman) haplotigs Sire (Angus) haplotigs 49.6% (67.3x) 10.9 kb 49.3% (66.9x) 11.7 kb 1.1% (1.4x), avg 1.3 kb canu • K-mer profiling of each parent (Illumina, 60x) Dam k-mers Sire k-mers • K-mer profiling of the F1 (PacBio, 120x) Angus x Brahman F1 14% 12% 8% 4% Error % Human A. thaliana
  5. Classification with sequencing error Pick minimum k-mer given genome size

    to avoid random collision to maximize survival } K-mers sensitive to SVs and SNPs } Each SNP == k k-mers 0.001 0.005 0.020 0.100 10 100 1000 10000 100000 Marker density Read length 14% 12% 8% 4% Error % Human A. thaliana } Expect } 90% confidence reads ≥ 5 kbp have at least one k-mer } Observe } 87.4% of all bases } avg read length 12 kbp } 90% of all bases >= 5kbp
  6. A. thaliana Falcon-unzip vs TrioBinning TrioBinning NG50 = 7.8 Mbp,

    Falcon-unzip = 5.5 Mbp (diploid genome size)
  7. Comparing H. sapiens NA12878 10x vs TrioBinning TrioBinning NG50 =

    1.2 Mbp, 10X contig NG50 = 0.1 Mbp (mother) (mother) (father) (father)
  8. MHC Comparison 10X average edit distance: 45.25 bp, TrioBinning average

    edit distance: 0.1 bp Pseudohap1 (paternal) Pseudohap2 P P P M M P ? M M M M P ? ? Hap1 Hap1I Maternal Paternal Supernova Trio Binning
  9. What do you miss with a poor reference? } UMD3

    vs Nelore (B. indicus) } No variants >200 bp • UMD3 vs Brahman (maternal) • No variants > 1kbp • Father (B. taurus) vs Mother (B. indicus) • Complete profile LINE tRNA-Core-RTE (BovA) RTE-BovB
  10. (Mb) *NG50: Adjusted N50 for Genome Size 2.7 Gb trio

    binning Bos taurus ref 0.1 0.3 23.4 26.6 25.2 1.2 7.2 79.2 85.9 104.8 0 20 40 60 80 100 120 NG50 Max Private new ref First haplotig N50 > 20M ever!! Assembly Size (Gb) #  of  Contigs (kb) UMD3.1.1 2.6 75.4 BTau 5.0.1 2.7 42.5 Brahman 2.7 1.6 Angus 2.6 1.7 ARS-­‐UCD1.0.19   2.7 2.7 0 1,000 2,000 3,000 4,000 Angus Brahman ARS19 Single-copy Duplicated Fragmented BUSCO Genes Two cattle genomes
  11. % of chromosome • Counting variations shared in both Brahman

    and Angus (<50kb) • 3,178 inversions shared in Brahman and Angus haplotype (mean size 9.5 kb) • 2~6% of each chromosome will be lost • Discrepancy mostly goes away when comparing to the latest ARS19 Errors are common in UMD reference
  12. } Gene annotation } Lifted over 28,556 UMD3 RefSeq genes

    downloaded from BovineGenome.org } Genes in Angus assembly } 16,434 genes completely lifted over } 8,406 / 8,466 genes healed from gaps } Genes on chrY not lifted over } Genes in Brahman assembly } 18,105 genes completely lifted over } 9,366 / 9,401 genes healed from gaps } Heterozygosity (%) } Measuring SNPs, short INDELs, SVs when comparing Brahman and Angus assemblies } For each variation called in Brahman (D) and Angus (S); } Heterozygosity = 100 x { ∑max(D, S) / (1M + e ) } } where e = max(D, S) – min(D,S), extra sequence not in the 1M frame Brahman Angus 1 M e D S Measuring heterozygosity
  13. MHC Class II of Angus and Brahman chr23 24 -

    26 M Heterozygosity: 14.26 % Bovine MHC Class II UMD3 (Herford) Angus UMD3 (Herford) Brahman QTL: Milk fatty acid Meat fatty acid C:14, C20 ELOVL5 ?
  14. } No inbreeding is ever perfect } Time consuming }

    Wrong strategy } Select most outbred individual along with parents to improve haplotype resolution } Get two full haplotypes phased across full genome } Greater continuity than assembling without trio information with sufficient coverage } Minimal additional cost of two Illumina libraries } Can also work with ancestral/survey data } Limited in regions of parent and child homozygosity (e.g. 0/1 genotype in all) } Trio approach cannot resolve unless spanned by reads ¨ Select more outbred individual ¨ Sequence with longer reads } Sequence/assembler agnostic } Polish/gap-fill as before using haplotype-assigned sequences } Combine with Hi-C to get haplotype resolved chromosomes A new strategy to generate references?
  15. Acknowledgements genomeinformatics.github.io } Adam Phillippy } Sergey Koren } Arang

    Rhie } Brian Walenz } Alexander Dilthey } Brian Ondov canu.readthedocs.io } Adam Phillippy } Sergey Koren } Brian Walenz } Konstantin Berlin } Jason Miller } Cow F1 collaborators } Tim Smith } John Williams } Sarah Kingan