Slide 1

Slide 1 text

Arang Rhie Adam Phillippy’s Group Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI Haplotype resolved human genomes @ArangRhie

Slide 2

Slide 2 text

The genome assembly problem

Slide 3

Slide 3 text

The diploid genome assembly problem Diploid genome Smashed Assembly Phased (haploid) assembly phasing ? De novo: From scratch, without looking at the original picture (reference) Sequenced reads sequencing assembling Pseudo-haplotype + alts

Slide 4

Slide 4 text

Why assemble genomes again, de novo?

Slide 5

Slide 5 text

Asian specific insertions and the frequency, found from AK1 Under-Represented Variations in GRCh38 Seo, Rhie, Kim, and Lee et al., De novo assembly and phasing of a Korean human genome, Nature (2016)

Slide 6

Slide 6 text

Identify haplotype differences A B • CYP2D6 is involved in metabolizing >50% of available drugs • Genetic variation and copy number affects drug efficacy CYP2D6*10: Intermediate ~ poor metabolizer CYP2D6*2: Extensive metabolizer Seo, Rhie, Kim, and Lee et al., De novo assembly and phasing of a Korean human genome, Nature (2016) Chr. 22

Slide 7

Slide 7 text

Can we phase across the whole chromosomes? Seo, Rhie, Kim, and Lee et al., De novo assembly and phasing of a Korean human genome, Nature (2016)

Slide 8

Slide 8 text

Complete haplotype-resolved assemblies with trio binning

Slide 9

Slide 9 text

The diploid genome assembly problem Diploid genome Smashed Assembly Phased (haploid) assembly phasing ? De novo: From scratch, without looking at the original picture (reference) Sequenced reads sequencing assembling Complete haplotypes

Slide 10

Slide 10 text

The diploid genome assembly problem Diploid genome Paternal assembly ? De novo: From scratch, without looking at the original picture (reference) Phased reads sequencing assembling Phased reads Maternal assembly assembling

Slide 11

Slide 11 text

Trio binning with parental k-mers Koren and Rhie et al, De novo assembly of haplotype-resolved genomes with trio binning, Nat. Biotech (2018) Paternal haplotigs Maternal haplotigs • K-mer profiling of each parent (Illumina, 60x) Paternal k-mers Maternal k-mers • K-mer profiling of the child (PacBio, 120x) Child Paternal Maternal 49.6% (67.3x) 10.9 kb 49.3% (66.9x) 11.7 kb 1.1% (1.4x), avg 1.3 kb Paternal reads Maternal reads • Childs’ read binning and assembling canu

Slide 12

Slide 12 text

Robust for a wide range of heterozygosity 0.8% 1.2% 1.6% 0.9% *Heterozygosity level estimated with GenomeScope 1.5% 0.12 % 0.20 % 0.29 % NA12878 (CEU) F HG00733 (PUR) F NA19240 (YRI) F HG002 (Ashkenazi) M Platform PacBio (WashU) PacBio 60kb (20kb) PacBio (WashU) PacBio 15kb CCS Haplotype (Cov.) Maternal (32+9x) Paternal (31+9x) Maternal (44.6x) Paternal (43.6x) Maternal (37x) Paternal (31x) Maternal (11+8x) Paternal (11+8x) NG50 (Mb) 1.2 1.2 19.1 23.9 9.0 3.0 17.0 14.9 0.17 %

Slide 13

Slide 13 text

The HG002 CCS Assembly HG002 (Ashkenazi) M PacBio 15kb CCS Maternal (11+8x) Paternal (11+8x) 17.0 14.9 Platform Haplotype (Cov.) NG50 (Mb) 3.04 2.96 Size (Gb)

Slide 14

Slide 14 text

A nearly perfect diploid genome 125x PacBio coverage (~60x per haplotype), TrioCanu haplotig NG50 ~70 Mbp, BUSCOs 94% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 X Maternal (yak) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 X Paternal (highland) Esperanza 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X GRCh38

Slide 15

Slide 15 text

1 5 Human Pan-Genome Project Population: http://www.internationalgenome.org/ Initiative to collect diverse, high-quality haplotypes with trio binning • Illumina WGS for the parents, PacBio and Nanopore for the child • Pilot 10 trios selected to maximize non-ref haplotype AF 2 PUR 1 KHV 3 ACB 1 MSL 1 PJL 1 GWD 1 CLM 5 African 3 American 1 East Asian 1 South Asian

Slide 16

Slide 16 text

What can you see from a phased assembly? Koren and Rhie et al, De novo assembly of haplotype-resolved genomes with trio binning, Nat. Biotech (2018) 0

Slide 17

Slide 17 text

Phasing the MHC region Koren and Rhie et al, De novo assembly of haplotype-resolved genomes with trio binning, Nat. Biotech (2018)

Slide 18

Slide 18 text

No more short-read polishing! • Adding coverage hurts quality! • 4QV drop • Quality estimate for haplotypes >99.99% (QV45) • Indels are haplotype mixing, not tech • Need a trio • Illumina-polishing not necessary (and can hurt) • Current diploid Arrow not enough (3QV drop) Annotation data courtesy of NCBI (Francoise Thibaud-Nissen) Frameshift corrected protein-coding genes in bovids Illumina 454 + PacBio + Illumina 454 + Illumina SOLiD PacBio + Illumina PacBio Trio 2013 2014 2015 2016 2018 PacBio ?

Slide 19

Slide 19 text

• Diploid assembly is solved by trios Trio binning is current best practice All levels of assembly quality improved Complete haplotypes will become the new norm • A human pan-genome reference A collection of diverse, high-quality haplotypes Including complex heterozygous SVs Summary

Slide 20

Slide 20 text

Kronenberg et al., FALCON-Phase: Integrating PacBio and Hi-C data for phased diploid genomes, BioRxiv (2018) FALCON-Phase Trio-binning FALCON-Phase as an alternative? • Investigating ways to improve for less het. genomes HG002 (0.17) Angus x Brahman (0.93) bTaeGut2 (1.2)

Slide 21

Slide 21 text

VGP GenomeArk: 1st data release https://vgp.github.io/genomeark Jennifer Vashon of Maine Department of Inland Fisheries and Wildlife, left, and UMass lynx team coordinator, Tanya Lama, with an adult male lynx from northern Maine whose DNA was used to create first-ever whole genome for the species. The lynx has since been released to the wild. (MassWildlife photo / Bill Byrne)

Slide 22

Slide 22 text

Acknowledgements genomeinformatics.github.io • Adam Phillippy • Sergey Koren • Brian Walenz • Alexander Dilthey • Brian Ondov • Jay Ghurye Korean (AK1) Jeong-Sun Seo Changhoon Kim Junsoo Kim Sangjin Lee Tim Smith John Williams Cattle/pigs Pan-Genome Karen Miga Benedict Paten NIH NHGRI NISC VGP Assembly Working Group Erich Jarvis Richard Durbin Gene Myers Kerstin Howe Harris Lewin Olivier Fedrigo Shane McCarthy Martin Pippel Will Chow Joana Damas PacBio CCS Michael Hunkapiller Paul Peluso David Rank Trio binning is available in https://github.com/marbl/canu

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Koren and Rhie et al, De novo assembly of haplotype-resolved genomes with trio binning, Nat. Biotech (2018) 24 Pseudo-haplotype + alts Complete haplotypes Assembly Graph Smashed haplotypes

Slide 25

Slide 25 text

Trio-binning outperforms FALCON-Unzip Koren and Rhie et al, De novo assembly of haplotype-resolved genomes with trio binning, Nat. Biotech (2018) Primary = Longest path in the graph (pseudo-hap) Alternate haplotigs = Alternate path in the bubble Haplotigs = Contigs in each assembly agree with parental haplotypes (Phased) TrioCanu FALCON-unzip Angus specific k-mer counts Angus specific k-mer counts Brahman specific k-mer counts Brahman specific k-mer counts

Slide 26

Slide 26 text

Phasing NA12878 Koren and Rhie et al, De novo assembly of haplotype-resolved genomes with trio binning, Nat. Biotech (2018) TrioCanu FALCON-Unzip Supernova