Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NovoGraph presentation at BioIT19

Evan Biederstedt
April 17, 2019
61

NovoGraph presentation at BioIT19

Evan Biederstedt

April 17, 2019
Tweet

Transcript

  1. Evan Biederstedt Kravis Center for Molecular Oncology
 MSKCC https://github.com/NCBI-Hackathons/NovoGraph @EBiederstedt

    #BioIT19 NovoGraph Genome graph construction from multiple long-read de novo assemblies A genome graph representation of seven ethnically diverse whole human genomes 1
  2. 3 Linear Alignment Variant calling Standard NGS workflow There are

    known problems with this approach Eggertsson et al. Nat Genet. 2017; 49(11): 1654–1660
  3. 4 Limitations of the “Classic” Linear Reference Genome • This

    approach fails for reads from hyperpolymorphic regions (e.g. MHC) or regions with large and/or complex SVs • Estimated >1% of human genome inaccessible with classic approach. Dilthey et al, Nat Genet. 2015; 47(6): 682–688 • The linear reference (GRCh 38) is a single consensus haploid genome. This biases alignment towards reference allele, results in inaccurate alignments, and skews genotyping accuracy. • Alignment not aware of sequence variation between/within subpopulations • African pan-genome of 910 individuals contains ~10% more DNA than the current human reference genome. Sherman et al. Nat Genet. 2019; 30–35 (2019)
  4. 5 Genome Graph Nodes — sequences Edges —adjacencies between the

    sequences Paths — genomes, all haplotypes in graph Ideally, the sequence of every genotyped genome is a traversal the graph SequenceTubeMap https://github.com/vgteam/sequenceTubeMap Garrison, 2018: https://github.com/ekg/thesis/releases/tag/v1.0.0 Unified representation of multiple genomes from the same species Improved alignment and variant calling
  5. • Improved alignment for hyperpolymorphic regions • Reduction of missed

    SNP calls by 5-fold • Improved genotyping accuracy in MHC and HLA • Enable genotyping of 1000s of additional variants, >50bp Success! Contributions of Genome Graphs Eggertsson et al, Nat Genet. 2017 49(11):1654-1660 Sequence homology between HLA-A, -B, & -C Dilthey et al, PLoS Comput Biol. (2016) https://doi.org/10.1371/journal.pcbi.1005151 6 Graphtyper BayesTyper Sibbesen et al, Nature Genetics 50 1054–105 (2018 ) Dilthey et al, Nature Genetics 47 682–688 (2015) Spatial recovery of kmers within MHC class II region Sibbesen et al (2018) Dilthey et al (2015); Dilthey et al (2016) Eggertsson et al (2017)
  6. • Improved mapping accuracy against human genome • Accurate SV

    haplotyping at scale Success! Contributions of Genome Graphs Garrison et al, Nature Biotechnology 36 875–879 (2018) 7 vg — variation graphs Garrison et al, 2018 Graph Genome Suite Rakocevic et al, Nature Genetic 51 354–362 (2019) Rakocevic et al, 2019
  7. 8 NovoGraph: Motivation “Ultra-long reads enabled assembly and phasing of

    the 4-Mb major histocompatibility complex (MHC) locus in its entirety, measurement of telomere repeat length, and closure of gaps in the reference human genome assembly GRCh38.” • Graph genome construction normally relies on VCFs derived from short read data • But short-read sequencing has limited sensitivity for hyper-variable and SV-rich regions • Long-read de novo assemblies enables the assembly of complex sequences (Jain et al, Nat. Biotech 2018) in a reference-bias-free way and detection of SVs at high sensitivity (Sedlazeck et al, Nat. Methods 2018) • Use these assemblies to construct a graph genome! Jain et al, Nat Biotechnol. 2018 338–345. NovoGraph encompasses a wide spectrum of genetic variation accessible to long-read-based de novo assemblies, e.g. divergent haplotypes and large scale SVs (at base-pair resolution)
  8. Graph genome 9 Seven ethnically-diverse human whole-genome assemblies • AK1,

    Korean • CHM1, European • CHM13, European • HG003, Ashkenazi • HG004, Ashkenazi • HX1, Han Chinese • NA19240, Yoruba Long-read de novo assemblies (Thank you GIAB and others!)
  9. 10 NovoGraph pipeline 1. For each input contig, perform global

    pairwise alignment 2. Compute global MSA between all input contigs and the reference 3. Generate graph from global MSA, connecting contigs at homologous-identical positions Output VCF (graph topology) • Overlapping variant alleles • Non-overlapping variant alleles
  10. 11 Genome Graph generation NovoGraph-Universal: Genome graph constructed with non-overlapping

    variant alleles From global MSA, all unique paths of the genome graph are enumerated and written to output “Extend” paths until deviation from the reference (gray terminates. Then “flush”, i.e. write to output
  11. Results 12 •NovoGraph-Simple—Genome graph constructed with overlapping variant alleles \\

    VCF contains 33,309,666 "bubbles" (i.e. sites with multiple alternative alleles) representing 34,519,145 variant alleles. •NovoGraph-Universal—Genome graph constructed with non-overlapping variant alleles VCF contains 23,478,835 bubbles representing 30,582,795 variant alleles IGV screenshot of HLA-B High rates of polymorphism are observed around peptide-binding- site encoding exons 2 and 3.
  12. 13 • Mapped subsampled reads to our 7-human genome graph

    vs. against a genome graph constructed from the GRCh38 • Assessed alignment metrics Mean alignment scores Alignment identity scores Number of mapped reads • Improvement! Performance vg mapping experiment https://github.com/vgteam/vg As expected, mapping against the genome graph increases mean alignment scores and alignment identities, albeit at a small reduction in the number of mapped reads. Genome Graph Reference Genome Mean Scores 108.859 108.100 Mean Identity Value 0.9913 0.9891 Total Mapped Reads 31125004 31138410
  13. • Jeff Oliver (Arizona) • Nancy Hansen (NHGRI) • Aarti

    Jajoo (Baylor) • Nathan Dunn (LBNL) • Andrew Olson (CSHL) Ben Busby (NCBI) Alexander Dilthey (NHGRI, HHU/UKD) 15