Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NovoGraph slides DataScience2018 CSHL

Evan Biederstedt
November 09, 2018
88

NovoGraph slides DataScience2018 CSHL

Lighting talk, under 60 seconds
(I went over)

Evan Biederstedt

November 09, 2018
Tweet

Transcript

  1. Evan Biederstedt https://github.com/NCBI-Hackathons/NovoGraph @EBiederstedt NovoGraph Genome graph construction from multiple

    long-read de novo assemblies A genome graph representation of seven ethnically diverse whole human genomes 1
  2. • Improved alignment for hyperpolymorphic regions • Reduction of missed

    SNP calls by 5-fold • Improved genotyping accuracy in MHC and HLA • Enable genotyping of 1000s of additional variants, >50bp Success! Contributions of Genome Graphs Garrison et al, Nature Biotechnology 36 875–879 (2018) Sequence homology between HLA-A, -B, & -C Dilthey et al, PLoS Comput Biol. (2016) https://doi.org/10.1371/journal.pcbi.1005151 2 vg — variation graphs BayesTyper Sibbesen et al, Nature Genetics 50 1054–105 (2018 ) Dilthey et al, Nature Genetics 47 682–688 (2015) Spatial recovery of kmers within MHC class II region Sibbesen et al (2018) Dilthey et al (2015); Dilthey et al (2016) Eggertsson et al (2017)
  3. 3 Motivation “Ultra-long reads enabled assembly and phasing of the

    4-Mb major histocompatibility complex (MHC) locus in its entirety, measurement of telomere repeat length, and closure of gaps in the reference human genome assembly GRCh38.” • Graph genome construction normally relies on VCFs derived from short read data • But short-read sequencing has limited sensitivity for hyper variable and SV-rich regions • Long-read de novo assemblies enables the assembly of complex sequences (Jain et al 2018) in a reference-bias-free way and detection of SVs at high sensitivity (Sedlazeck et al 2018) Jain et al, Nat Biotechnol. 2018 338–345. Let’s use these assemblies to construct a graph genome!!!
  4. 4 NovoGraph pipeline 1. For each input contig, perform global

    pairwise alignment 2. Compute global MSA between all input contigs and the reference 3. Generate graph from global MSA, connecting contigs at homologous-identical positions Output VCF (graph topology) • Overlapping variant alleles • Non-overlapping variant alleles
  5. Graph genome 5 Seven ethnically-diverse human whole-genome assemblies •NovoGraph-Simple—Genome graph

    constructed with overlapping variant alleles VCF contains 33,309,666 "bubbles" (i.e. sites with multiple alternative alleles) representing 34,519,145 variant alleles. •NovoGraph-Universal—Genome graph constructed with non-overlapping variant alleles VCF contains 23,478,835 bubbles representing 30,582,795 variant alleles • AK1, Korean • CHM1, European • CHM13, European • HG003, Ashkenazi • HG004, Ashkenazi • HX1, Han Chinese • NA19240
  6. 6 • Mapped subsampled reads to our 7-human genome graph

    vs. against a genome graph constructed from the GRCh38 • Assessed alignment metrics Mean alignment scores Alignment identity scores Number of mapped reads • Improvement! Performance IGV screenshot of HLA-B High rates of polymorphism are observed around peptide-binding-site encoding exons 2 and 3. vg mapping experiment (Ask me how I indexed this…) https://github.com/vgteam/vg
  7. Come with questions to Poster Session I after these talks!

    https://f1000research.com/articles/7-1391/v1 7
  8. • Jeff Oliver (Arizona) • Nancy Hansen (NHGRI) • Aarti

    Jajoo (Baylor) • Nathan Dunn (LBNL) • Andrew Olson (CSHL) Ben Busby (NCBI) Alexander Dilthey (NHGRI, HHU/UKD) 8