Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analysis of the VGP 1st data release assemblies

GenomeArk
January 16, 2019

Analysis of the VGP 1st data release assemblies

Arang Rhie, Adam Phillippy, and the VGP Assembly Working Group
NIH, Bethesda, MD, USA

Last year, we evaluated sequencing platform and assembly methods. A year past, we made progress and had our first VGP data release for 15 genomes. In the first part in my presentation, I try to summarize what happened in the past year. Some preliminary results of the 1st data release are shown in the 2nd part, along with the status of the assemblies. Looking forward, now we are at a state to re-define the VGP assembly quality metrics in addition to the 3.2.4.QV40. The following slides are to help leading the discussion over defining a ‘high quality genome assembly’. Of note, the last slide was interactively made, and is not complete. Any comments / thoughts are welcome!

GenomeArk

January 16, 2019
Tweet

More Decks by GenomeArk

Other Decks in Research

Transcript

  1. Arang Rhie, Adam Phillippy and the VGP Assembly Working Group
    NHGRI, NIH Bethesda, MD, USA
    Jan. 16th 2019
    Analysis of the VGP 1st data release assemblies

    View Slide

  2. 2
    The Assembly Working Group
    Eric D. Jarvis
    Olivier Fedrigo
    Sadye Paez
    Adam M. Phillippy
    Arang Rhie
    Sergey Koren
    Zemin Ning
    Kerstin Howe
    William Chow
    Harris Lewin
    Joana Damas
    Richard Durbin
    Shane McCarthy
    Gene Myers
    Martin Pippel
    Marcela U-Silva
    Jonas Korlach
    Ivan Sovic
    Christopher Dunn
    Sarah Kingan
    Maria Simbirsky
    Brett Hannigan
    Siddarth Selvaraj
    Guojie Zhang
    Yang Zhou
    Chai Fungtammasan

    View Slide

  3. Contents
    3
    • What have been done over the past year?
    • Remaining challenges for the next pipeline
    • Discussion on ‘high quality genome assembly’

    View Slide

  4. Journey of the VGP assembly working group
    G10K Workshop
    @ PAG Jan, 2018
    G10K Meeting
    @ Rockefeller Sep, 2018
    G10K Workshop
    @ PAG Jan, 2019
    Evaluating platforms & tools for assembly
    Proposed trio binning to resolve phasing

    View Slide

  5. Journey of the VGP assembly working group
    G10K Workshop
    @ PAG Jan, 2018
    G10K Meeting
    @ Rockefeller Sep, 2018
    G10K Workshop
    @ PAG Jan, 2019
    Agreed on a pipeline
    Chose 17 genomes
    Completed contigs
    Haplotig purging
    Contamination found during curation Chin et al., Nat Met. (2016)
    !

    View Slide

  6. Journey of the VGP assembly working group
    G10K Workshop
    @ PAG Jan, 2018
    G10K Meeting
    @ Rockefeller Sep, 2018
    G10K Workshop
    @ PAG Jan, 2019
    Announcing the 1st data release (15 genomes)
    14 scaffolded / 13 Lightly curated
    Discussion on defining ‘completeness’

    View Slide

  7. Journey of the VGP assembly working group
    G10K Workshop
    @ PAG Jan, 2018
    G10K Meeting
    @ Rockefeller Sep, 2018
    G10K Workshop
    @ PAG Jan, 2019
    Started training
    Recruited volunteers

    View Slide

  8. Welcome our 1st volunteers!
    Marcela Uliano
    Maxmilian Driller
    Chul Lee
    Giulio Formenti
    Simona Secomandi
    Univ. of Milan
    Freie Univ. Berlin
    Seoul Nat. Univ.
    Calvinna Caswara
    Majid Vafadar
    Chai Fungtammasan
    Nicholas Hill

    View Slide

  9. Journey of the VGP assembly working group
    G10K Workshop
    @ PAG Jan, 2018
    G10K Meeting
    @ Rockefeller Sep, 2018
    G10K Workshop
    @ PAG Jan, 2019
    Where are we now?

    View Slide

  10. The Vertebrate Genomes Project Pipeline
    Rhie and VGP Assembly Working Group, in preparation
    1
    PacBio
    10XG
    Contigging
    + Purging
    Scaffolding
    BioNano
    Scaffolding
    Hi-C
    Gap-filling &
    Curation
    Final assembly
    A
    A
    A
    C TGGA
    TGGGGA
    TGGGGA
    TGGGGA
    A TGGGGA
    Polishing
    Scaffolding
    exon 1 exon 2 exon 3
    Primary
    Alternate

    View Slide

  11. Summary Status and Statistics
    • Most of our 1st data release assemblies meet our 3.4.2QV40 quality goals
    0.4
    0.4
    4.6
    1.8
    5.6
    3.1
    2.1
    3.1
    12.9
    14.5
    5.0
    4.4
    4.3
    12.0
    9.5
    15.0
    7.7
    6.9
    29.9
    24.8
    44.9
    37.1
    18.2
    10.1
    33.4
    130.2
    117.4
    59.6
    32.2
    58.0
    67.4
    58.4
    73.7
    68.3
    116.2
    103.3
    58.1
    0.1 1.0 10.0 100.0
    Contig N50 Scaffold N50

    View Slide

  12. How better are we?

    View Slide

  13. The VGP finch genomes
    Sanger ref.
    VGP Primary asm. VGP Primary asm.
    Sanger ref.
    Each box = Chr
    Chr Z
    Chr 2
    Chr 1+Chr1B
    I have both
    Z and W
    I am the same
    bTaeGut1
    Contig N50=12.0 Mb
    Scaffold N50=58.4 Mb
    Contig N50: 4.0 Mb
    Scaffold N50: 67.4 Mb
    bTaeGut1 bTaeGut2
    Chr Z
    Chr 2
    Chr 1+Chr1B

    View Slide

  14. Hunting down Z and W from trios
    CR1
    Paternal
    Maternal

    View Slide

  15. RNA/Iso-Seq confirms allele specific expression
    Chr. W : 382 – 461 k
    Chr. Z : 382 – 461 k
    TXNL1
    ST8SIA3 WDR7
    Brain
    Ovary
    Brain
    Ovary
    TXNL1 ST8SIA3
    WDR7
    ~25x
    ~100x
    ~200x
    ~16x
    bTaeGut2 Brain IsoSeq
    bTaeGut1 Brain IsoSeq
    bTaeGut2 W
    Brain
    Ovary
    Brain
    Ovary
    bTaeGut2 Z
    ~100x
    ~25x
    ~16x
    ~200x

    View Slide

  16. Challenges for the next pipeline
    1
    6

    View Slide

  17. The genomes assembly problem
    Esperanza
    Molly, yak dam
    Duke, highland sire
    ~1% heterozygosity

    View Slide

  18. Smashed haplotype

    View Slide

  19. Pseudo-haplotype

    View Slide

  20. Complete haplotypes

    View Slide

  21. Non-Trio VGP scaffolds Trio-binning VGP scaffolds
    Z
    W
    2 genomes in 1 genome
    Paternal
    Paternal
    Maternal Maternal

    View Slide

  22. Kronenberg et al., FALCON-Phase: Integrating PacBio and Hi-C data for phased diploid genomes, BioRxiv (2018)
    FALCON-Phase
    Trio-binning
    FALCON-Phase as an alternative?
    • Investigating ways to improve for less het. genomes
    HG002
    (0.17)
    Angus x Brahman
    (0.93)
    bTaeGut2
    (1.2)

    View Slide

  23. Interleaving scaffolding problem
    BioNano
    Scaffolding
    Contig with no label
    Too short to properly orient with hi-C
    Boundaries too repetitive to place with 10X
    Pairing haplotype

    View Slide

  24. bCalAnn1 v1.h
    10XG
    Longranger
    10XG
    Longranger
    PacBio
    arrow
    maxhits=10
    randombest
    PacBio
    arrow
    maxhits=10
    randombest
    TLK1
    bCalAnn1 v1.p
    Left-out sequences from polishing
    All reads attracted to
    alts

    View Slide

  25. • Most genomes meet the initial quality standard
    • Some genomes far exceeds
    • Challenges remaining
    • What’s the definition of “Chromosome-scale”?
    • Integrated pipeline for scaffolding and polishing
    Summary

    View Slide

  26. Discussion

    View Slide

  27. • NG stat
    • Based on what genome size? Haplotigs, contigs, scaffolds.
    • K-mer completeness
    • Spectra-cn, how much in the sequencing set has been seen in our
    assemblies? How much are we missing? Completeness of the heterozygous
    region?
    • Completeness of core genes
    • BUSCO
    • Completeness of repeats
    • Estimate through self alignment? Something like LTR Assembly Index
    (https://doi.org/10.1093/nar/gky730)
    • Completeness of chromosomes
    • Telomere / centromere validation? Definition of “chromosome-scale”?
    2
    7
    Quality Standards - Completeness

    View Slide

  28. 2
    8
    K-mers as a measure of completeness
    • K-mers only in assembly
    (misassembled bps)
    • Haplotype completeness
    • Over-assembled (duplications)
    • Repeat copies ~ exp. copies?
    KAT Spectra-cn plots:
    https://github.com/TGAC/KAT
    Mapleson et al.,
    Bioinformatics (2016)

    View Slide

  29. • Mapped coverage as a function of number of supporting platforms
    • What fraction of the genome is supported by >1 >2 >3 >4 platforms? Plot with x={>1 >2
    >3 >4}, y=cumulative sum of covered genome fraction
    • XX% bases have been polished, XX bases left unpolished (unmappable)
    • Estimate XX bases are QV>40
    • Hi-C maps
    • Look as expected
    • Comparative alignments to related species
    • Agreement between species within the evolutionary divergence time of XX yrs?
    • NGA50 (aligned segment NG50)
    • ML-based quality estimator for both structure and base quality
    • Outputs two probabilities per base: prob that base is accurate and prob that
    surrounding ~1,000 bp is structurally correct
    2
    9
    Quality Standards - Correctness

    View Slide

  30. 3
    0
    What is “High quality genome assembly”
    Quality Finished Reference HQ-Draft Draft
    Contig NG50 = Chr. NG50 >10 Mb >1 Mb
    Scaffold NG50 = Chr. NG50 = Chr. NG50 >10 Mb
    NGS (Reliable) NG50 = Chr. NG50
    Chromosomes 100% assigned >95% >90%
    k-mer >99% present >95% >90%
    BUSCO
    100% assigned, Dup
    ~ haplotypes >95%
    >90%
    Large genes, diff. to
    assemble – TITAN, ...
    Core genes 100% complete >90%
    >90%
    <10% misassembled
    >90% found, some
    truncated /
    misassembled
    Phased?
    100% of the whole
    genome >95%
    >90% of haploid predicted
    region
    MT All alleles present 1 Major allele 1 allele
    Sex chromosomes
    Present, right order,
    no gaps
    Present, localized
    hom pairs
    At least 1 longer chr
    present (X or Z)
    Base QV 50 45 40 30
    Gaps – 2 type of gaps

    View Slide

  31. 3
    1

    View Slide

  32. VGP standard 1.5 pipeline
    https://github.com/VGP/vgp-assembly
    10X
    FALCON + Unzip
    + Arrow
    Bionano
    Solve hybrid scaffold
    HiC
    Salsa2
    Pacbio
    Primary contigs: c1 cmaps
    Solve pipeline
    scaff10x (2 rounds)
    Scaffolds: s1
    Scaffolds: s2
    Scaffolds: s3
    t1: Arrow
    t2: freebayes
    t3: freebayes
    Assembling /
    scaffolding
    Gap filling /
    Polishing
    Alternate haplotigs: c2
    pri.asm
    alt.asm
    s4: s3 + q2
    purge_haplotigs
    Curated haplotigs: p2
    Curated primary: p1
    Alternate combined: q2
    gEval + Evol. Highway
    additional haplotig purging, decontamination, etc.
    Polishing: t1~t3
    t3.p
    pri.cur
    alt.cur
    Curation

    View Slide