The Graph Positional Burrows- Wheeler Transform

The Graph Positional Burrows- Wheeler Transform Adam Novak, Erik Garrison,
Benedict Paten

Graph Genomes (Recap) Represent a collection of genomes as a
graph of nodes, each with two sides connected by edges

Human Genome Variation Map A next generation genomic reference that
includes known variation from all human populations and provides consistent methods to represent complex genetic variation. Gil McVean, Oxford

Storing Haplotypes in Genome Graphs Track individual samples as threads
through the graph reference

Problem: Threads are Unwieldy Need to store 2 threads per
sample, per chromosome: 115k threads in 1000 Genomes Each thread visits ~1.3 million nodes on average Impractical to scan ~150 billion visits for queries

The Positional Burrows-Wheeler Transform Mechanism for storing large numbers of
haoplotypes in small numbers of bits Represent haplotypes as sequences of bits, one per site – 0 = ref allele, 1 = alt allele At each site, stably re-order haplotypes by allele at the previous site

PBWT Example A->T C->CC G->C AT->A 1 0 1

Put the haplotypes that had 0 A->T C->CC G->C AT->A
1 0 1

Then the haplotypes that had 1 A->T C->CC G->C AT->A
1 0 1

Fill in values at next site A->T C->CC G->C AT->A
1 0 0 1 1 1

Again, haplotypes that had 0 A->T C->CC G->C AT->A 1
0 0 1 1 1

Then those that had 1 A->T C->CC G->C AT->A 1
0 0 1 1 1

Fill in values again A->T C->CC G->C AT->A 1 0
1 0 1 1 1 1 0

Continue thusly A->T C->CC G->C AT->A 1 0 1 0
1 1 1 1 0

Complete PBWT A->T C->CC G->C AT->A 1 0 1 1
0 1 1 1 1 1 0 0

0 1 Group Visits by Node 1 0 1 1
0 1 1 1 1 0 A T C CC G C AT A

#1 #2 Store Edge to Next Node: gPBWT #1 #2
#2 end #2 #2 end #2 #1 end A T C CC G C AT A 1 2 2 1 1 2 2 1 1 2 2 1

The Graph Positional Burrows- Wheeler Transform Store visits at nodes
Order by order of incoming edges Each entry says what edge to take next

gPBWT Performance Chr22, 1KG VCF-derived graph 50,818,468 bp All 5008
Haplotypes Stored 573 MB gPBWT 0.018 bits per haplotype base Sub-haplotype search on any graph path is linear in query size. Space used by gPBWT data Space used by graph itself

First Application: Haplotype-Aware Read Mapping ~ 1% Primary / 2.5%
Secondary mappings not consistent with any 1KG haplotype

Open Questions/Hackathon Projects How do we efficiently build this data
structure in a cyclic graph? How do we actually use haplotype path queries in alignment? How do we attach this to a linked-data framework or RDF interface?

Thank you! Hackathon Organizers Glenn Hickey Sean Blum Maciek Smuga-Otto
David Haussler GA4GH 1000 Genomes Project

The Graph Positional Burrows- Wheeler Transform

The Graph Positional Burrows- Wheeler Transform

adamnovak

Other Decks in Research

Featured

Transcript

The Graph Positional Burrows- Wheeler Transform Adam Novak, Erik Garrison,

Graph Genomes (Recap) Represent a collection of genomes as a

Human Genome Variation Map A next generation genomic reference that

Storing Haplotypes in Genome Graphs Track individual samples as threads

Problem: Threads are Unwieldy Need to store 2 threads per

The Positional Burrows-Wheeler Transform Mechanism for storing large numbers of

PBWT Example A->T C->CC G->C AT->A 1 0 1

Put the haplotypes that had 0 A->T C->CC G->C AT->A

Then the haplotypes that had 1 A->T C->CC G->C AT->A

Fill in values at next site A->T C->CC G->C AT->A

Again, haplotypes that had 0 A->T C->CC G->C AT->A 1

Then those that had 1 A->T C->CC G->C AT->A 1

Fill in values again A->T C->CC G->C AT->A 1 0

Continue thusly A->T C->CC G->C AT->A 1 0 1 0

Continue thusly A->T C->CC G->C AT->A 1 0 1 0

Complete PBWT A->T C->CC G->C AT->A 1 0 1 1

0 1 Group Visits by Node 1 0 1 1

#1 #2 Store Edge to Next Node: gPBWT #1 #2

The Graph Positional Burrows- Wheeler Transform Store visits at nodes

gPBWT Performance Chr22, 1KG VCF-derived graph 50,818,468 bp All 5008

First Application: Haplotype-Aware Read Mapping ~ 1% Primary / 2.5%

Open Questions/Hackathon Projects How do we efficiently build this data

Thank you! Hackathon Organizers Glenn Hickey Sean Blum Maciek Smuga-Otto