Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GiST: A Generalized Search Tree for Database Systems

GiST: A Generalized Search Tree for Database Systems

A talk given at Hebrew University in Jerusalem, Tel Aviv University, UC Berkeley, Brown University, IBM Almaden Research Center. An extended version of a talk given at VLDB95

Joe Hellerstein

September 01, 1995
Tweet

More Decks by Joe Hellerstein

Other Decks in Technology

Transcript

  1. jmh - GiST 1/19/96, p 1 GiST: A Generalized Search

    Tree for Database Systems Joe Hellerstein UC Berkeley
  2. jmh - GiST 1/19/96, p 2 Road Map s Motivation

    s Intuition on Generalized Search Trees s Overview of GiST ADT s Example indices: integers, polygons & sets s Implementation challenges s Open problems in indexing research
  3. jmh - GiST 1/19/96, p 3 Indexing in OO/OR Systems

    s Quick access to user-defined objects s Support queries natural to the objects s Two previous approaches – Specialized Indices (“ABCDEFG-trees”) » redundant code: most trees are very similar » concurrency control, etc. tricky! – Extensible B-trees & R-trees (Postgres/Illustra) » B-tree or R-tree lookups only! » E.g. ‘WHERE movie.video < ‘Terminator 2’
  4. jmh - GiST 1/19/96, p 4 A Third Approach s

    A generalized search tree. Must be: s Extensible in terms of queries s General (B+-tree, R-tree, etc.) s Easy to extend s Efficient (match specialized trees) s Highly concurrent, recoverable, etc.
  5. jmh - GiST 1/19/96, p 5 Uses for GiSTs s

    New indexes needed for new apps... – find all supersets of S – find all molecules that bind to M – your favorite query here (multimedia?) s ...and for new queries over old domains: – find all points in region from 12 to 2 o’clock – find all strings that match R. E.
  6. jmh - GiST 1/19/96, p 8 Database Search Trees from

    40,000 feet Internal Nodes (directory) Leaf Nodes (linked list)
  7. jmh - GiST 1/19/96, p 9 Database Search Trees from

    30,000 feet Internal Nodes (directory) Leaf Nodes (linked list) key1 key2 ...
  8. jmh - GiST 1/19/96, p 10 GiST: Generalized Search Tree

    s Structure: balanced tree of (p, ptr) pairs – p is a key “predicate” – p holds for all objects below ptr – keys on a page may overlap s Key predicates: a user-defined class – This is the only extensibility required!
  9. jmh - GiST 1/19/96, p 11 Key Methods s Search:

    – Consistent(E,q): E.p ∧ q? (no/maybe) s Characterization – Union(P): new key that holds for all tuples in P s Categorization – Penalty(E1 ,E2 ): penalty of inserting E2 in subtree E1 – PickSplit(P): split P into two groups of entries
  10. jmh - GiST 1/19/96, p 12 Search s General technique:

    – traverse tree where Consistent is TRUE s For range predicates on ordered domain: – user specifies IsOrdered – user registers Compare(p1 , p2 ) operator – methods ensure ordered, non-overlapping keys – traverse leftmost Consistent branch – scan right across bottom.
  11. jmh - GiST 1/19/96, p 13 Insert s descend tree

    along least increase in Penalty s if there’s room at leaf, insert there s else split according to PickSplit s propagate changes using Union s Notes: – on overflow, can do R*-tree style reinsert – for ordered keys, Penalty needs to keep order
  12. jmh - GiST 1/19/96, p 14 Delete s find the

    entry via Search, and delete it s propagate changes using Union s on underflow: – if ordered keys, do B+-tree style borrow/coalesce – else reinsert stuff on page and delete page
  13. jmh - GiST 1/19/96, p 15 GiSTS over (B+-trees) s

    Logically, keys represent ranges [x,y) s Queries: Contains([a,b), v) s Consistent(E,q): (x<b) ∧ (y > a) s Union(P): [MIN(xi ), MAX(yi )) s Penalty(E1 , E2 ): – return MAX(y2 - y1 , 0) + MAX(x1 - x2 , 0) – if E1 is leftmost or rightmost, drop a term s PickSplit(P): split evenly in order
  14. jmh - GiST 1/19/96, p 16 Key Compression s Keys

    may take up too much room on a page s Two extra key methods: – Compress(E)/Decompress(E) s Compression can be lossy: over-generalization OK
  15. jmh - GiST 1/19/96, p 17 A B+-tree Page Logical

    Representation: Physical Representation (compressed): [201, ∞) [137, 201) [60, 137) [40, 60) [∞ , 40) 201 137 60 40 <null>
  16. jmh - GiST 1/19/96, p 18 B+-tree Compression s Compress(E=([x,y),

    ptr)): – if E is leftmost return NULL, else return x s Decompress(E=(π, ptr)): – if E is leftmost, let x = -∞, else let x = π. – if E is rightmost, let y = ∞, else let y be the value stored in the next key on the right. – if E is rightmost on a leaf page, let y = x+1.
  17. jmh - GiST 1/19/96, p 19 GiSTs over R2 (R-tree)

    s Logically, keys represent bounding boxes s Queries: Contains, Overlaps, Equals s Consistent(E,q): does E.p overlap q? s Union(P): bounding box of all entries s Compress(E): form bounding box s Decompress(E): identity function s Penalty(E,F): size(Union({E,F}) - size(E) s PickSplit(P): R-tree or R*-tree methods
  18. jmh - GiST 1/19/96, p 20 GiSTs over P( )

    (RD-tree) s Logically, keys represent bounding sets s Queries: Contains, Overlaps, Equals s Consistent(E,q): does E.p ∩ q = ∅? s Union(P): set-union of keys s Compress(E): Bloom filters, rangesets, etc. s Decompress(E): match compress s Penalty(E,F): |E.p ∪ F.p| - |E.p| s PickSplit(P): R-tree algorithms
  19. jmh - GiST 1/19/96, p 21 An RD-tree {CS1, CS11,

    Music1, Music2, Math221, Math22, Math223} {CS1, Bus101, Bus102, Bus103, Ec121, Ec122, Ec123} {CS1, CS786, CS888, Math221, Music1, Music788} {Bus101, Bus102, Bus103, CS1} {Bus101, Ec121, Ec122, Ec123} {CS1, Bus101, Ec121} {CS1, CS11, Math221} {Music1, Music2, CS1} {CS1, Math221, Math22, Math223} {Music1, CS1, Math221} {Music788, CS888, CS786} {CS1}
  20. jmh - GiST 1/19/96, p 22 Implementation Issues s In-memory

    efficiency: Node subclass s Concurrency, Recovery, Consistency – Kornacker & Banks, VLDB95 s Variable-Length Keys s Bulk Loading s Optimizer Integration s Extensibility & Efficiency
  21. jmh - GiST 1/19/96, p 23 GiST Performance s B+-trees

    have O(log n) performance s R-trees, RD-trees have no such guarantee – search may have to traverse multiple paths – worst-case O(2n) to traverse entire tree – aggravated by random I/O: much worse than scan! SO: when does it pay to build/use an index?
  22. jmh - GiST 1/19/96, p 24 GiST Performance, cont. s

    As a first cut, look at 2 parameters: – data overlap & compression loss s Experiment with Illustra’s R-trees – Comb sets: {[1,10], [10001,10010], ...} – 30 data sets, each of 10,000 combs – vary data overlap, numranges (compression) – 5 queries per dataset, searching for comb teeth
  23. jmh - GiST 1/19/96, p 25 GiST Performance, cont. 0.2

    Compression Loss 0 0.1 0.2 0.3 0.4 0.5 0 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 Data Overlap Avg. Number of I/Os
  24. jmh - GiST 1/19/96, p 26 Future Directions in Indexing

    s Indexability theory: – when is an index useful? Papadimitriou? s New things to index! Queries over: – sets, sequences/text (REs), graphs, multimedia, molecular structures... s Lossy compression techniques s Algorithmic improvements? – (R*-tree techniques?)
  25. jmh - GiST 1/19/96, p 27 The Gist of the

    GiST s Boil search trees down to their essence. s Unify B+-tree, R-tree, etc. in one ADT. s Extensible in terms of data and queries. s Opens research on indexability.
  26. jmh - GiST 1/19/96, p 28 Status s Prototype implementation

    in Postgres95 – currently no variable-length keys, concurrency s Illustra/Informix port? s General purpose C++ library planned s Papers, etc. at: – http://www.cs.berkeley.edu/~jmh/