Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analysis of the VGP 1st data release assemblies

GenomeArk
January 16, 2019

Analysis of the VGP 1st data release assemblies

Arang Rhie, Adam Phillippy, and the VGP Assembly Working Group
NIH, Bethesda, MD, USA

Last year, we evaluated sequencing platform and assembly methods. A year past, we made progress and had our first VGP data release for 15 genomes. In the first part in my presentation, I try to summarize what happened in the past year. Some preliminary results of the 1st data release are shown in the 2nd part, along with the status of the assemblies. Looking forward, now we are at a state to re-define the VGP assembly quality metrics in addition to the 3.2.4.QV40. The following slides are to help leading the discussion over defining a ‘high quality genome assembly’. Of note, the last slide was interactively made, and is not complete. Any comments / thoughts are welcome!

GenomeArk

January 16, 2019
Tweet

More Decks by GenomeArk

Other Decks in Research

Transcript

  1. Arang Rhie, Adam Phillippy and the VGP Assembly Working Group

    NHGRI, NIH Bethesda, MD, USA Jan. 16th 2019 Analysis of the VGP 1st data release assemblies
  2. 2 The Assembly Working Group Eric D. Jarvis Olivier Fedrigo

    Sadye Paez Adam M. Phillippy Arang Rhie Sergey Koren Zemin Ning Kerstin Howe William Chow Harris Lewin Joana Damas Richard Durbin Shane McCarthy Gene Myers Martin Pippel Marcela U-Silva Jonas Korlach Ivan Sovic Christopher Dunn Sarah Kingan Maria Simbirsky Brett Hannigan Siddarth Selvaraj Guojie Zhang Yang Zhou Chai Fungtammasan
  3. Contents 3 • What have been done over the past

    year? • Remaining challenges for the next pipeline • Discussion on ‘high quality genome assembly’
  4. Journey of the VGP assembly working group G10K Workshop @

    PAG Jan, 2018 G10K Meeting @ Rockefeller Sep, 2018 G10K Workshop @ PAG Jan, 2019 Evaluating platforms & tools for assembly Proposed trio binning to resolve phasing
  5. Journey of the VGP assembly working group G10K Workshop @

    PAG Jan, 2018 G10K Meeting @ Rockefeller Sep, 2018 G10K Workshop @ PAG Jan, 2019 Agreed on a pipeline Chose 17 genomes Completed contigs Haplotig purging Contamination found during curation Chin et al., Nat Met. (2016) !
  6. Journey of the VGP assembly working group G10K Workshop @

    PAG Jan, 2018 G10K Meeting @ Rockefeller Sep, 2018 G10K Workshop @ PAG Jan, 2019 Announcing the 1st data release (15 genomes) 14 scaffolded / 13 Lightly curated Discussion on defining ‘completeness’
  7. Journey of the VGP assembly working group G10K Workshop @

    PAG Jan, 2018 G10K Meeting @ Rockefeller Sep, 2018 G10K Workshop @ PAG Jan, 2019 Started training Recruited volunteers
  8. Welcome our 1st volunteers! Marcela Uliano Maxmilian Driller Chul Lee

    Giulio Formenti Simona Secomandi Univ. of Milan Freie Univ. Berlin Seoul Nat. Univ. Calvinna Caswara Majid Vafadar Chai Fungtammasan Nicholas Hill
  9. Journey of the VGP assembly working group G10K Workshop @

    PAG Jan, 2018 G10K Meeting @ Rockefeller Sep, 2018 G10K Workshop @ PAG Jan, 2019 Where are we now?
  10. The Vertebrate Genomes Project Pipeline Rhie and VGP Assembly Working

    Group, in preparation 1 PacBio 10XG Contigging + Purging Scaffolding BioNano Scaffolding Hi-C Gap-filling & Curation Final assembly A A A C TGGA TGGGGA TGGGGA TGGGGA A TGGGGA Polishing Scaffolding exon 1 exon 2 exon 3 Primary Alternate
  11. Summary Status and Statistics • Most of our 1st data

    release assemblies meet our 3.4.2QV40 quality goals 0.4 0.4 4.6 1.8 5.6 3.1 2.1 3.1 12.9 14.5 5.0 4.4 4.3 12.0 9.5 15.0 7.7 6.9 29.9 24.8 44.9 37.1 18.2 10.1 33.4 130.2 117.4 59.6 32.2 58.0 67.4 58.4 73.7 68.3 116.2 103.3 58.1 0.1 1.0 10.0 100.0 Contig N50 Scaffold N50
  12. How better are we?

  13. The VGP finch genomes Sanger ref. VGP Primary asm. VGP

    Primary asm. Sanger ref. Each box = Chr Chr Z Chr 2 Chr 1+Chr1B I have both Z and W I am the same bTaeGut1 Contig N50=12.0 Mb Scaffold N50=58.4 Mb Contig N50: 4.0 Mb Scaffold N50: 67.4 Mb bTaeGut1 bTaeGut2 Chr Z Chr 2 Chr 1+Chr1B
  14. Hunting down Z and W from trios CR1 Paternal Maternal

  15. RNA/Iso-Seq confirms allele specific expression Chr. W : 382 –

    461 k Chr. Z : 382 – 461 k TXNL1 ST8SIA3 WDR7 Brain Ovary Brain Ovary TXNL1 ST8SIA3 WDR7 ~25x ~100x ~200x ~16x bTaeGut2 Brain IsoSeq bTaeGut1 Brain IsoSeq bTaeGut2 W Brain Ovary Brain Ovary bTaeGut2 Z ~100x ~25x ~16x ~200x
  16. Challenges for the next pipeline 1 6

  17. The genomes assembly problem Esperanza Molly, yak dam Duke, highland

    sire ~1% heterozygosity
  18. Smashed haplotype

  19. Pseudo-haplotype

  20. Complete haplotypes

  21. Non-Trio VGP scaffolds Trio-binning VGP scaffolds Z W 2 genomes

    in 1 genome Paternal Paternal Maternal Maternal
  22. Kronenberg et al., FALCON-Phase: Integrating PacBio and Hi-C data for

    phased diploid genomes, BioRxiv (2018) FALCON-Phase Trio-binning FALCON-Phase as an alternative? • Investigating ways to improve for less het. genomes HG002 (0.17) Angus x Brahman (0.93) bTaeGut2 (1.2)
  23. Interleaving scaffolding problem BioNano Scaffolding Contig with no label Too

    short to properly orient with hi-C Boundaries too repetitive to place with 10X Pairing haplotype
  24. bCalAnn1 v1.h 10XG Longranger 10XG Longranger PacBio arrow maxhits=10 randombest

    PacBio arrow maxhits=10 randombest TLK1 bCalAnn1 v1.p Left-out sequences from polishing All reads attracted to alts
  25. • Most genomes meet the initial quality standard • Some

    genomes far exceeds • Challenges remaining • What’s the definition of “Chromosome-scale”? • Integrated pipeline for scaffolding and polishing Summary
  26. Discussion

  27. • NG stat • Based on what genome size? Haplotigs,

    contigs, scaffolds. • K-mer completeness • Spectra-cn, how much in the sequencing set has been seen in our assemblies? How much are we missing? Completeness of the heterozygous region? • Completeness of core genes • BUSCO • Completeness of repeats • Estimate through self alignment? Something like LTR Assembly Index (https://doi.org/10.1093/nar/gky730) • Completeness of chromosomes • Telomere / centromere validation? Definition of “chromosome-scale”? 2 7 Quality Standards - Completeness
  28. 2 8 K-mers as a measure of completeness • K-mers

    only in assembly (misassembled bps) • Haplotype completeness • Over-assembled (duplications) • Repeat copies ~ exp. copies? KAT Spectra-cn plots: https://github.com/TGAC/KAT Mapleson et al., Bioinformatics (2016)
  29. • Mapped coverage as a function of number of supporting

    platforms • What fraction of the genome is supported by >1 >2 >3 >4 platforms? Plot with x={>1 >2 >3 >4}, y=cumulative sum of covered genome fraction • XX% bases have been polished, XX bases left unpolished (unmappable) • Estimate XX bases are QV>40 • Hi-C maps • Look as expected • Comparative alignments to related species • Agreement between species within the evolutionary divergence time of XX yrs? • NGA50 (aligned segment NG50) • ML-based quality estimator for both structure and base quality • Outputs two probabilities per base: prob that base is accurate and prob that surrounding ~1,000 bp is structurally correct 2 9 Quality Standards - Correctness
  30. 3 0 What is “High quality genome assembly” Quality Finished

    Reference HQ-Draft Draft Contig NG50 = Chr. NG50 >10 Mb >1 Mb Scaffold NG50 = Chr. NG50 = Chr. NG50 >10 Mb NGS (Reliable) NG50 = Chr. NG50 Chromosomes 100% assigned >95% >90% k-mer >99% present >95% >90% BUSCO 100% assigned, Dup ~ haplotypes >95% >90% Large genes, diff. to assemble – TITAN, ... Core genes 100% complete >90% >90% <10% misassembled >90% found, some truncated / misassembled Phased? 100% of the whole genome >95% >90% of haploid predicted region MT All alleles present 1 Major allele 1 allele Sex chromosomes Present, right order, no gaps Present, localized hom pairs At least 1 longer chr present (X or Z) Base QV 50 45 40 30 Gaps – 2 type of gaps
  31. 3 1

  32. VGP standard 1.5 pipeline https://github.com/VGP/vgp-assembly 10X FALCON + Unzip +

    Arrow Bionano Solve hybrid scaffold HiC Salsa2 Pacbio Primary contigs: c1 cmaps Solve pipeline scaff10x (2 rounds) Scaffolds: s1 Scaffolds: s2 Scaffolds: s3 t1: Arrow t2: freebayes t3: freebayes Assembling / scaffolding Gap filling / Polishing Alternate haplotigs: c2 pri.asm alt.asm s4: s3 + q2 purge_haplotigs Curated haplotigs: p2 Curated primary: p1 Alternate combined: q2 gEval + Evol. Highway additional haplotig purging, decontamination, etc. Polishing: t1~t3 t3.p pri.cur alt.cur Curation