GiST: A Generalized Search Tree for Database Systems

Slide 1

Slide 1 text

jmh - GiST 1/19/96, p 1 GiST: A Generalized Search Tree for Database Systems Joe Hellerstein UC Berkeley

Slide 2

Slide 2 text

jmh - GiST 1/19/96, p 2 Road Map s Motivation s Intuition on Generalized Search Trees s Overview of GiST ADT s Example indices: integers, polygons & sets s Implementation challenges s Open problems in indexing research

Slide 3

Slide 3 text

jmh - GiST 1/19/96, p 3 Indexing in OO/OR Systems s Quick access to user-defined objects s Support queries natural to the objects s Two previous approaches – Specialized Indices (“ABCDEFG-trees”) » redundant code: most trees are very similar » concurrency control, etc. tricky! – Extensible B-trees & R-trees (Postgres/Illustra) » B-tree or R-tree lookups only! » E.g. ‘WHERE movie.video < ‘Terminator 2’

Slide 4

Slide 4 text

jmh - GiST 1/19/96, p 4 A Third Approach s A generalized search tree. Must be: s Extensible in terms of queries s General (B+-tree, R-tree, etc.) s Easy to extend s Efficient (match specialized trees) s Highly concurrent, recoverable, etc.

Slide 5

Slide 5 text

jmh - GiST 1/19/96, p 5 Uses for GiSTs s New indexes needed for new apps... – find all supersets of S – find all molecules that bind to M – your favorite query here (multimedia?) s ...and for new queries over old domains: – find all points in region from 12 to 2 o’clock – find all strings that match R. E.

Slide 6

Slide 6 text

jmh - GiST 1/19/96, p 6 Database Search Trees from 50,000 feet

Slide 7

Slide 7 text

jmh - GiST 1/19/96, p 7 Database Search Trees from 50,000 feet

Slide 8

Slide 8 text

jmh - GiST 1/19/96, p 8 Database Search Trees from 40,000 feet Internal Nodes (directory) Leaf Nodes (linked list)

Slide 9

Slide 9 text

jmh - GiST 1/19/96, p 9 Database Search Trees from 30,000 feet Internal Nodes (directory) Leaf Nodes (linked list) key1 key2 ...

Slide 10

Slide 10 text

jmh - GiST 1/19/96, p 10 GiST: Generalized Search Tree s Structure: balanced tree of (p, ptr) pairs – p is a key “predicate” – p holds for all objects below ptr – keys on a page may overlap s Key predicates: a user-defined class – This is the only extensibility required!

Slide 11

Slide 11 text

jmh - GiST 1/19/96, p 11 Key Methods s Search: – Consistent(E,q): E.p ∧ q? (no/maybe) s Characterization – Union(P): new key that holds for all tuples in P s Categorization – Penalty(E1 ,E2 ): penalty of inserting E2 in subtree E1 – PickSplit(P): split P into two groups of entries

Slide 12

Slide 12 text

jmh - GiST 1/19/96, p 12 Search s General technique: – traverse tree where Consistent is TRUE s For range predicates on ordered domain: – user specifies IsOrdered – user registers Compare(p1 , p2 ) operator – methods ensure ordered, non-overlapping keys – traverse leftmost Consistent branch – scan right across bottom.

Slide 13

Slide 13 text

jmh - GiST 1/19/96, p 13 Insert s descend tree along least increase in Penalty s if there’s room at leaf, insert there s else split according to PickSplit s propagate changes using Union s Notes: – on overflow, can do R*-tree style reinsert – for ordered keys, Penalty needs to keep order

Slide 14

Slide 14 text

jmh - GiST 1/19/96, p 14 Delete s find the entry via Search, and delete it s propagate changes using Union s on underflow: – if ordered keys, do B+-tree style borrow/coalesce – else reinsert stuff on page and delete page

Slide 15

Slide 15 text

jmh - GiST 1/19/96, p 15 GiSTS over (B+-trees) s Logically, keys represent ranges [x,y) s Queries: Contains([a,b), v) s Consistent(E,q): (x a) s Union(P): [MIN(xi ), MAX(yi )) s Penalty(E1 , E2 ): – return MAX(y2 - y1 , 0) + MAX(x1 - x2 , 0) – if E1 is leftmost or rightmost, drop a term s PickSplit(P): split evenly in order

Slide 16

Slide 16 text

jmh - GiST 1/19/96, p 16 Key Compression s Keys may take up too much room on a page s Two extra key methods: – Compress(E)/Decompress(E) s Compression can be lossy: over-generalization OK

Slide 17

Slide 17 text

jmh - GiST 1/19/96, p 17 A B+-tree Page Logical Representation: Physical Representation (compressed): [201, ∞) [137, 201) [60, 137) [40, 60) [∞ , 40) 201 137 60 40

Slide 18

Slide 18 text

jmh - GiST 1/19/96, p 18 B+-tree Compression s Compress(E=([x,y), ptr)): – if E is leftmost return NULL, else return x s Decompress(E=(π, ptr)): – if E is leftmost, let x = -∞, else let x = π. – if E is rightmost, let y = ∞, else let y be the value stored in the next key on the right. – if E is rightmost on a leaf page, let y = x+1.

Slide 19

Slide 19 text

jmh - GiST 1/19/96, p 19 GiSTs over R2 (R-tree) s Logically, keys represent bounding boxes s Queries: Contains, Overlaps, Equals s Consistent(E,q): does E.p overlap q? s Union(P): bounding box of all entries s Compress(E): form bounding box s Decompress(E): identity function s Penalty(E,F): size(Union({E,F}) - size(E) s PickSplit(P): R-tree or R*-tree methods

Slide 20

Slide 20 text

jmh - GiST 1/19/96, p 20 GiSTs over P( ) (RD-tree) s Logically, keys represent bounding sets s Queries: Contains, Overlaps, Equals s Consistent(E,q): does E.p ∩ q = ∅? s Union(P): set-union of keys s Compress(E): Bloom filters, rangesets, etc. s Decompress(E): match compress s Penalty(E,F): |E.p ∪ F.p| - |E.p| s PickSplit(P): R-tree algorithms

Slide 21

Slide 21 text

jmh - GiST 1/19/96, p 21 An RD-tree {CS1, CS11, Music1, Music2, Math221, Math22, Math223} {CS1, Bus101, Bus102, Bus103, Ec121, Ec122, Ec123} {CS1, CS786, CS888, Math221, Music1, Music788} {Bus101, Bus102, Bus103, CS1} {Bus101, Ec121, Ec122, Ec123} {CS1, Bus101, Ec121} {CS1, CS11, Math221} {Music1, Music2, CS1} {CS1, Math221, Math22, Math223} {Music1, CS1, Math221} {Music788, CS888, CS786} {CS1}

Slide 22

Slide 22 text

jmh - GiST 1/19/96, p 22 Implementation Issues s In-memory efficiency: Node subclass s Concurrency, Recovery, Consistency – Kornacker & Banks, VLDB95 s Variable-Length Keys s Bulk Loading s Optimizer Integration s Extensibility & Efficiency

Slide 23

Slide 23 text

jmh - GiST 1/19/96, p 23 GiST Performance s B+-trees have O(log n) performance s R-trees, RD-trees have no such guarantee – search may have to traverse multiple paths – worst-case O(2n) to traverse entire tree – aggravated by random I/O: much worse than scan! SO: when does it pay to build/use an index?

Slide 24

Slide 24 text

jmh - GiST 1/19/96, p 24 GiST Performance, cont. s As a first cut, look at 2 parameters: – data overlap & compression loss s Experiment with Illustra’s R-trees – Comb sets: {[1,10], [10001,10010], ...} – 30 data sets, each of 10,000 combs – vary data overlap, numranges (compression) – 5 queries per dataset, searching for comb teeth

Slide 25

Slide 25 text

jmh - GiST 1/19/96, p 25 GiST Performance, cont. 0.2 Compression Loss 0 0.1 0.2 0.3 0.4 0.5 0 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 Data Overlap Avg. Number of I/Os

Slide 26

Slide 26 text

jmh - GiST 1/19/96, p 26 Future Directions in Indexing s Indexability theory: – when is an index useful? Papadimitriou? s New things to index! Queries over: – sets, sequences/text (REs), graphs, multimedia, molecular structures... s Lossy compression techniques s Algorithmic improvements? – (R*-tree techniques?)

Slide 27

Slide 27 text

jmh - GiST 1/19/96, p 27 The Gist of the GiST s Boil search trees down to their essence. s Unify B+-tree, R-tree, etc. in one ADT. s Extensible in terms of data and queries. s Opens research on indexability.

Slide 28

Slide 28 text

jmh - GiST 1/19/96, p 28 Status s Prototype implementation in Postgres95 – currently no variable-length keys, concurrency s Illustra/Informix port? s General purpose C++ library planned s Papers, etc. at: – http://www.cs.berkeley.edu/~jmh/