Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PAGXXVI: TrioBinning

Sergey Koren
January 16, 2018

PAGXXVI: TrioBinning

PAGXXVI combined presentations from Genome Informatics section outlining the trio binning strategy to produce complete haplotypes from a single genome.

Sergey Koren

January 16, 2018
Tweet

More Decks by Sergey Koren

Other Decks in Research

Transcript

  1. TrioBinning: Trio-based assembly
    How I stopped worrying and learned to love the F1
    Genome Informatics Section, NHGRI

    View Slide

  2. What is wrong with inbred genomes?
    } Incomplete inbreeding
    } Heterozygosity important for fitness
    } Mixture of homozygous and collapsed heterozygous
    regions
    } Incomplete phasing
    } No association of blocks to a haplotype
    } Short phase blocks
    } Missed diverged heterozygous regions

    View Slide

  3. } Megabubbles
    } Variants output separately
    } Phased but short
    } Homozygous regions are single-copy
    } Falcon associated “haplotigs” report only one half of bubble
    Variant Terminology
    https://support.10xgenomics.com/de-novo-assembly/software/pipelines/latest/output/generating
    } Pseudohaplotypes
    } Random path through variants
    } Not phased but long
    } Falcon primary contigs are an example
    } Haplotigs
    } Consistent path through each haplotype
    } Homozygous regions represented twice
    } Each set of haplotigs is a complete representation of a single haplotype

    View Slide

  4. 0.001 0.005 0.020 0.100
    10 100 1000 10000 100000
    Marker density
    Read length
    Trio Binning
    Dam (Brahman) haplotigs
    Sire (Angus) haplotigs
    49.6%
    (67.3x)
    10.9 kb
    49.3%
    (66.9x)
    11.7 kb
    1.1% (1.4x), avg 1.3 kb
    canu
    • K-mer profiling of each parent (Illumina, 60x)
    Dam
    k-mers
    Sire
    k-mers
    • K-mer profiling of the F1 (PacBio, 120x)
    Angus x Brahman F1
    14%
    12%
    8%
    4%
    Error %
    Human
    A. thaliana

    View Slide

  5. Classification with sequencing error
    Pick minimum k-mer given genome size to avoid random collision to maximize survival
    } K-mers sensitive to SVs
    and SNPs
    } Each SNP == k k-mers
    0.001 0.005 0.020 0.100
    10 100 1000 10000 100000
    Marker density
    Read length
    14%
    12%
    8%
    4%
    Error %
    Human
    A. thaliana
    } Expect
    } 90% confidence reads ≥ 5 kbp
    have at least one k-mer
    } Observe
    } 87.4% of all bases
    } avg read length 12 kbp
    } 90% of all bases >= 5kbp

    View Slide

  6. Validation on A. thaliana COLxCVI F1

    View Slide

  7. A. thaliana Falcon-unzip vs TrioBinning
    TrioBinning NG50 = 7.8 Mbp, Falcon-unzip = 5.5 Mbp (diploid genome size)

    View Slide

  8. Comparing H. sapiens NA12878 10x vs TrioBinning
    TrioBinning NG50 = 1.2 Mbp, 10X contig NG50 = 0.1 Mbp
    (mother)
    (mother)
    (father)
    (father)

    View Slide

  9. Short read don’t cut it
    Corrected phase block NG50: TrioBinning: 12.92 Mbp, 10x: 4.26 Mbp
    Alus
    LINEs

    View Slide

  10. MHC Comparison
    10X average edit distance: 45.25 bp, TrioBinning average edit distance: 0.1 bp
    Pseudohap1
    (paternal)
    Pseudohap2
    P P P M M
    P
    ?
    M M M M
    P
    ?
    ?
    Hap1
    Hap1I
    Maternal
    Paternal
    Supernova
    Trio Binning

    View Slide

  11. Class II
    Supernova Trio Binning

    View Slide

  12. B. taurus FALCON-Unzip vs TrioBinning
    #10MbHaplotigClub

    View Slide

  13. What do you miss with a poor reference?
    } UMD3 vs Nelore (B. indicus)
    } No variants >200 bp
    • UMD3 vs Brahman (maternal)
    • No variants > 1kbp
    • Father (B. taurus) vs Mother (B. indicus)
    • Complete profile
    LINE
    tRNA-Core-RTE
    (BovA)
    RTE-BovB

    View Slide

  14. (Mb)
    *NG50: Adjusted N50 for Genome Size 2.7 Gb
    trio binning
    Bos taurus ref
    0.1 0.3
    23.4 26.6 25.2
    1.2 7.2
    79.2 85.9
    104.8
    0
    20
    40
    60
    80
    100
    120
    NG50 Max
    Private new ref
    First haplotig N50 > 20M ever!!
    Assembly Size (Gb) #  of  Contigs (kb)
    UMD3.1.1 2.6 75.4
    BTau 5.0.1 2.7 42.5
    Brahman 2.7 1.6
    Angus 2.6 1.7
    ARS-­‐UCD1.0.19   2.7 2.7
    0 1,000 2,000 3,000 4,000
    Angus
    Brahman
    ARS19
    Single-copy Duplicated Fragmented
    BUSCO Genes
    Two cattle genomes

    View Slide

  15. % of chromosome
    • Counting variations shared in both Brahman and Angus (<50kb)
    • 3,178 inversions shared in Brahman and Angus haplotype (mean size 9.5 kb)
    • 2~6% of each chromosome will be lost
    • Discrepancy mostly goes away when comparing to the latest ARS19
    Errors are common in UMD reference

    View Slide

  16. } Gene annotation
    } Lifted over 28,556 UMD3 RefSeq genes downloaded from BovineGenome.org
    } Genes in Angus assembly
    } 16,434 genes completely lifted over
    } 8,406 / 8,466 genes healed from gaps
    } Genes on chrY not lifted over
    } Genes in Brahman assembly
    } 18,105 genes completely lifted over
    } 9,366 / 9,401 genes healed from gaps
    } Heterozygosity (%)
    } Measuring SNPs, short INDELs, SVs when comparing Brahman and Angus assemblies
    } For each variation called in Brahman (D) and Angus (S);
    } Heterozygosity = 100 x { ∑max(D, S) / (1M + e ) }
    } where e = max(D, S) – min(D,S), extra sequence not in the 1M frame
    Brahman
    Angus
    1 M
    e
    D
    S
    Measuring heterozygosity

    View Slide

  17. MHC Class II of Angus and Brahman
    chr23 24 - 26 M
    Heterozygosity: 14.26 %
    Bovine MHC Class II
    UMD3 (Herford)
    Angus
    UMD3 (Herford)
    Brahman
    QTL:
    Milk fatty acid
    Meat fatty acid C:14, C20
    ELOVL5
    ?

    View Slide

  18. https://kat.readthedocs.io/en/latest/
    KAT result on B. taurus
    Trio Binning
    FALCON-Unzip

    View Slide

  19. } No inbreeding is ever perfect
    } Time consuming
    } Wrong strategy
    } Select most outbred individual along with parents to improve haplotype resolution
    } Get two full haplotypes phased across full genome
    } Greater continuity than assembling without trio information with sufficient coverage
    } Minimal additional cost of two Illumina libraries
    } Can also work with ancestral/survey data
    } Limited in regions of parent and child homozygosity (e.g. 0/1 genotype in all)
    } Trio approach cannot resolve unless spanned by reads
    ¨ Select more outbred individual
    ¨ Sequence with longer reads
    } Sequence/assembler agnostic
    } Polish/gap-fill as before using haplotype-assigned sequences
    } Combine with Hi-C to get haplotype resolved chromosomes
    A new strategy to generate references?

    View Slide

  20. Acknowledgements
    genomeinformatics.github.io
    } Adam Phillippy
    } Sergey Koren
    } Arang Rhie
    } Brian Walenz
    } Alexander Dilthey
    } Brian Ondov
    canu.readthedocs.io
    } Adam Phillippy
    } Sergey Koren
    } Brian Walenz
    } Konstantin Berlin
    } Jason Miller
    } Cow F1 collaborators
    } Tim Smith
    } John Williams
    } Sarah Kingan

    View Slide