Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Graph Positional Burrows- Wheeler Transform

adamnovak
June 12, 2016

The Graph Positional Burrows- Wheeler Transform

I describe a novel approach to storing large numbers of haplotypes in a graph genome reference.

adamnovak

June 12, 2016
Tweet

Other Decks in Research

Transcript

  1. Graph Genomes (Recap) Represent a collection of genomes as a

    graph of nodes, each with two sides connected by edges
  2. Human Genome Variation Map A next generation genomic reference that

    includes known variation from all human populations and provides consistent methods to represent complex genetic variation. Gil McVean, Oxford
  3. Problem: Threads are Unwieldy Need to store 2 threads per

    sample, per chromosome: 115k threads in 1000 Genomes Each thread visits ~1.3 million nodes on average Impractical to scan ~150 billion visits for queries
  4. The Positional Burrows-Wheeler Transform Mechanism for storing large numbers of

    haoplotypes in small numbers of bits Represent haplotypes as sequences of bits, one per site – 0 = ref allele, 1 = alt allele At each site, stably re-order haplotypes by allele at the previous site
  5. 0 1 Group Visits by Node 1 0 1 1

    0 1 1 1 1 0 A T C CC G C AT A
  6. #1 #2 Store Edge to Next Node: gPBWT #1 #2

    #2 end #2 #2 end #2 #1 end A T C CC G C AT A 1 2 2 1 1 2 2 1 1 2 2 1
  7. The Graph Positional Burrows- Wheeler Transform Store visits at nodes

    Order by order of incoming edges Each entry says what edge to take next
  8. gPBWT Performance Chr22, 1KG VCF-derived graph 50,818,468 bp All 5008

    Haplotypes Stored 573 MB gPBWT 0.018 bits per haplotype base Sub-haplotype search on any graph path is linear in query size. Space used by gPBWT data Space used by graph itself
  9. First Application: Haplotype-Aware Read Mapping ~ 1% Primary / 2.5%

    Secondary mappings not consistent with any 1KG haplotype
  10. Open Questions/Hackathon Projects How do we efficiently build this data

    structure in a cyclic graph? How do we actually use haplotype path queries in alignment? How do we attach this to a linked-data framework or RDF interface?