GiST: A Generalized Search Tree for Database Systems

jmh - GiST 1/19/96, p 1 GiST: A Generalized Search
Tree for Database Systems Joe Hellerstein UC Berkeley

jmh - GiST 1/19/96, p 2 Road Map s Motivation
s Intuition on Generalized Search Trees s Overview of GiST ADT s Example indices: integers, polygons & sets s Implementation challenges s Open problems in indexing research

jmh - GiST 1/19/96, p 3 Indexing in OO/OR Systems
s Quick access to user-defined objects s Support queries natural to the objects s Two previous approaches – Specialized Indices (“ABCDEFG-trees”) » redundant code: most trees are very similar » concurrency control, etc. tricky! – Extensible B-trees & R-trees (Postgres/Illustra) » B-tree or R-tree lookups only! » E.g. ‘WHERE movie.video < ‘Terminator 2’

jmh - GiST 1/19/96, p 4 A Third Approach s
A generalized search tree. Must be: s Extensible in terms of queries s General (B+-tree, R-tree, etc.) s Easy to extend s Efficient (match specialized trees) s Highly concurrent, recoverable, etc.

jmh - GiST 1/19/96, p 5 Uses for GiSTs s
New indexes needed for new apps... – find all supersets of S – find all molecules that bind to M – your favorite query here (multimedia?) s ...and for new queries over old domains: – find all points in region from 12 to 2 o’clock – find all strings that match R. E.

jmh - GiST 1/19/96, p 6 Database Search Trees from
50,000 feet

50,000 feet

40,000 feet Internal Nodes (directory) Leaf Nodes (linked list)

30,000 feet Internal Nodes (directory) Leaf Nodes (linked list) key1 key2 ...

jmh - GiST 1/19/96, p 10 GiST: Generalized Search Tree
s Structure: balanced tree of (p, ptr) pairs – p is a key “predicate” – p holds for all objects below ptr – keys on a page may overlap s Key predicates: a user-defined class – This is the only extensibility required!

jmh - GiST 1/19/96, p 11 Key Methods s Search:
– Consistent(E,q): E.p ∧ q? (no/maybe) s Characterization – Union(P): new key that holds for all tuples in P s Categorization – Penalty(E1 ,E2 ): penalty of inserting E2 in subtree E1 – PickSplit(P): split P into two groups of entries

jmh - GiST 1/19/96, p 12 Search s General technique:
– traverse tree where Consistent is TRUE s For range predicates on ordered domain: – user specifies IsOrdered – user registers Compare(p1 , p2 ) operator – methods ensure ordered, non-overlapping keys – traverse leftmost Consistent branch – scan right across bottom.

jmh - GiST 1/19/96, p 13 Insert s descend tree
along least increase in Penalty s if there’s room at leaf, insert there s else split according to PickSplit s propagate changes using Union s Notes: – on overflow, can do R*-tree style reinsert – for ordered keys, Penalty needs to keep order

jmh - GiST 1/19/96, p 14 Delete s find the
entry via Search, and delete it s propagate changes using Union s on underflow: – if ordered keys, do B+-tree style borrow/coalesce – else reinsert stuff on page and delete page

jmh - GiST 1/19/96, p 15 GiSTS over (B+-trees) s
Logically, keys represent ranges [x,y) s Queries: Contains([a,b), v) s Consistent(E,q): (x<b) ∧ (y > a) s Union(P): [MIN(xi ), MAX(yi )) s Penalty(E1 , E2 ): – return MAX(y2 - y1 , 0) + MAX(x1 - x2 , 0) – if E1 is leftmost or rightmost, drop a term s PickSplit(P): split evenly in order

jmh - GiST 1/19/96, p 16 Key Compression s Keys
may take up too much room on a page s Two extra key methods: – Compress(E)/Decompress(E) s Compression can be lossy: over-generalization OK

jmh - GiST 1/19/96, p 17 A B+-tree Page Logical
Representation: Physical Representation (compressed): [201, ∞) [137, 201) [60, 137) [40, 60) [∞ , 40) 201 137 60 40 <null>

jmh - GiST 1/19/96, p 18 B+-tree Compression s Compress(E=([x,y),
ptr)): – if E is leftmost return NULL, else return x s Decompress(E=(π, ptr)): – if E is leftmost, let x = -∞, else let x = π. – if E is rightmost, let y = ∞, else let y be the value stored in the next key on the right. – if E is rightmost on a leaf page, let y = x+1.

jmh - GiST 1/19/96, p 19 GiSTs over R2 (R-tree)
s Logically, keys represent bounding boxes s Queries: Contains, Overlaps, Equals s Consistent(E,q): does E.p overlap q? s Union(P): bounding box of all entries s Compress(E): form bounding box s Decompress(E): identity function s Penalty(E,F): size(Union({E,F}) - size(E) s PickSplit(P): R-tree or R*-tree methods

jmh - GiST 1/19/96, p 20 GiSTs over P( )
(RD-tree) s Logically, keys represent bounding sets s Queries: Contains, Overlaps, Equals s Consistent(E,q): does E.p ∩ q = ∅? s Union(P): set-union of keys s Compress(E): Bloom filters, rangesets, etc. s Decompress(E): match compress s Penalty(E,F): |E.p ∪ F.p| - |E.p| s PickSplit(P): R-tree algorithms

jmh - GiST 1/19/96, p 21 An RD-tree {CS1, CS11,
Music1, Music2, Math221, Math22, Math223} {CS1, Bus101, Bus102, Bus103, Ec121, Ec122, Ec123} {CS1, CS786, CS888, Math221, Music1, Music788} {Bus101, Bus102, Bus103, CS1} {Bus101, Ec121, Ec122, Ec123} {CS1, Bus101, Ec121} {CS1, CS11, Math221} {Music1, Music2, CS1} {CS1, Math221, Math22, Math223} {Music1, CS1, Math221} {Music788, CS888, CS786} {CS1}

jmh - GiST 1/19/96, p 22 Implementation Issues s In-memory
efficiency: Node subclass s Concurrency, Recovery, Consistency – Kornacker & Banks, VLDB95 s Variable-Length Keys s Bulk Loading s Optimizer Integration s Extensibility & Efficiency

jmh - GiST 1/19/96, p 23 GiST Performance s B+-trees
have O(log n) performance s R-trees, RD-trees have no such guarantee – search may have to traverse multiple paths – worst-case O(2n) to traverse entire tree – aggravated by random I/O: much worse than scan! SO: when does it pay to build/use an index?

jmh - GiST 1/19/96, p 24 GiST Performance, cont. s
As a first cut, look at 2 parameters: – data overlap & compression loss s Experiment with Illustra’s R-trees – Comb sets: {[1,10], [10001,10010], ...} – 30 data sets, each of 10,000 combs – vary data overlap, numranges (compression) – 5 queries per dataset, searching for comb teeth

jmh - GiST 1/19/96, p 25 GiST Performance, cont. 0.2
Compression Loss 0 0.1 0.2 0.3 0.4 0.5 0 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 Data Overlap Avg. Number of I/Os

jmh - GiST 1/19/96, p 26 Future Directions in Indexing
s Indexability theory: – when is an index useful? Papadimitriou? s New things to index! Queries over: – sets, sequences/text (REs), graphs, multimedia, molecular structures... s Lossy compression techniques s Algorithmic improvements? – (R*-tree techniques?)

jmh - GiST 1/19/96, p 27 The Gist of the
GiST s Boil search trees down to their essence. s Unify B+-tree, R-tree, etc. in one ADT. s Extensible in terms of data and queries. s Opens research on indexability.

jmh - GiST 1/19/96, p 28 Status s Prototype implementation
in Postgres95 – currently no variable-length keys, concurrency s Illustra/Informix port? s General purpose C++ library planned s Papers, etc. at: – http://www.cs.berkeley.edu/~jmh/

GiST: A Generalized Search Tree for Database Sy...

GiST: A Generalized Search Tree for Database Systems

Joe Hellerstein

More Decks by Joe Hellerstein

Other Decks in Technology

Featured

Transcript

jmh - GiST 1/19/96, p 1 GiST: A Generalized Search

jmh - GiST 1/19/96, p 2 Road Map s Motivation

jmh - GiST 1/19/96, p 3 Indexing in OO/OR Systems

jmh - GiST 1/19/96, p 4 A Third Approach s

jmh - GiST 1/19/96, p 5 Uses for GiSTs s

jmh - GiST 1/19/96, p 6 Database Search Trees from

jmh - GiST 1/19/96, p 7 Database Search Trees from

jmh - GiST 1/19/96, p 8 Database Search Trees from

jmh - GiST 1/19/96, p 9 Database Search Trees from

jmh - GiST 1/19/96, p 10 GiST: Generalized Search Tree

jmh - GiST 1/19/96, p 11 Key Methods s Search:

jmh - GiST 1/19/96, p 12 Search s General technique:

jmh - GiST 1/19/96, p 13 Insert s descend tree

jmh - GiST 1/19/96, p 14 Delete s find the

jmh - GiST 1/19/96, p 15 GiSTS over (B+-trees) s

jmh - GiST 1/19/96, p 16 Key Compression s Keys

jmh - GiST 1/19/96, p 17 A B+-tree Page Logical

jmh - GiST 1/19/96, p 18 B+-tree Compression s Compress(E=([x,y),

jmh - GiST 1/19/96, p 19 GiSTs over R2 (R-tree)

jmh - GiST 1/19/96, p 20 GiSTs over P( )

jmh - GiST 1/19/96, p 21 An RD-tree {CS1, CS11,

jmh - GiST 1/19/96, p 22 Implementation Issues s In-memory

jmh - GiST 1/19/96, p 23 GiST Performance s B+-trees

jmh - GiST 1/19/96, p 24 GiST Performance, cont. s

jmh - GiST 1/19/96, p 25 GiST Performance, cont. 0.2

jmh - GiST 1/19/96, p 26 Future Directions in Indexing

jmh - GiST 1/19/96, p 27 The Gist of the

jmh - GiST 1/19/96, p 28 Status s Prototype implementation