Upgrade to Pro — share decks privately, control downloads, hide ads and more …

On the Analysis of Indexing Schemes

On the Analysis of Indexing Schemes

PODS '97 talk on Indexability Theory. Elderly powerpoint slides auto-upgraded; fidelity may not be perfect.

Joe Hellerstein

June 02, 1997
Tweet

More Decks by Joe Hellerstein

Other Decks in Technology

Transcript

  1. On The Analysis of Indexing Schemes Joe Hellerstein (Berkeley) Elias

    Koutsoupias (UCLA) Christos Papadimitriou (Berkeley)
  2. Background: GiSTs • Generalized Search Tree (GiST) – an extensible

    index structure Internal Nodes (directory) Leaf Nodes (linked list) pred pred ... – generalizes B+-tree, R-tree, TV-tree, many others… – see Kornacker’s SIGMOD talk for more details
  3. The Big Picture • GiST: Turing Machine of indexing –

    can index any data type – can support any set of queries!! • So what is an index? – A clustering scheme for data... – ...with a “directory” • For 2-ary storage: – cluster size = disk block – directory: high-fanout, balanced tree
  4. Wanted: Theory of Indexability • Systems solution with a (big!)

    theory problem – some things are more “indexable” than others – how to characterize this? • Keep faithful to the systems problem – cost metric is #I/Os – block size is a fundamental parameter!
  5. Outline • Formalizing indexability • Lower bounds for some typical

    workloads – range queries on n-d points – set/subset queries • Space/time tradeoffs – replicating items in the index • Future Work (lots!)
  6. Framework • Indexing workload: (D, I, Q) – D: a

    domain (e.g. Z, R2, P(Z)) – I Ì D: a (finite) instance – Q Ì P(I): a set of queries – workload:indexability » language:complexity • Indexing scheme – collection S = {S1 , …, Sn} of blocks – |Si | = B (200 or so), – scheme performance » algorithm perf. I S i i = !
  7. Performance Measures • Access Overhead (time) – (# blocks to

    cover Q) / (é|Q|/Bù) – Note: worst possible performance = B – access overhead of an indexing scheme is max access overhead over all queries Ideal performance Ideal performance Actual performance Actual performance
  8. Performance measures, cont. • Storage Redundancy (space) – max #

    of blocks containing an element of I – Average redundancy |S| / (|I |/B) Ideal storage Ideal storage Actual storage Actual storage
  9. Range queries • Theorem: – Any indexing scheme of redundancy

    1 for 2-d range queries has access overhead at least B1/2. – For d-d queries, lower bound is B1-1/d
  10. The 2-d case – all n2 “grid” points – 1´B,

    B´1 queries (2n2/B of them) – block S intersects x horizontal, y vertical “lines” – xy ³ B, so x+y ³ 2B1/2 – so S intersects at least 2B1/2 of the queries – # of intersecting query/block pairs ³ 2B1/2|S| = 2B1/2 n2/B – avg. # of blocks per query: B1/2. Q.E.D. – Note: worst & random access overhead = B1/2 – n-d case a straightforward extension n n S
  11. Same problem, with redundancy • Theorem: – The access overhead

    a and redundancy r must satisfy r a2log(2a2) ³ (logB)/2. • Proof – uses result from extremal set theory • (a/k/a Johnson’s Lemma, coding theory) – assumes n = W(B2) • Conjecture – r ³ log B/log a ?
  12. Set/subset workloads – Domain is P({1,…,n}), n > rB2 –

    Query: find all sets contained in s – Theorem: For each redundancy r, there exists a set workload with access overhead B. – Proof: instance is singleton sets {1}, …, {n}, query is subset of B items from {1, …, n}. • Each element can be in same block with at most rB elements • So there are n/(rB) ³ B elements such that no 2 of them are in the same block. Query with B of these takes B blocks.
  13. Related work – Main memory structures – The Brown connection

    • upper bounds and structures for 2-d range queries [Kanellakis, Ramswamy, Subramanian, Vengroff, Vitter, et al.] • special case of Thm 1 by Kanellakis, et al. in a recent version of PODS ‘93 paper • “additive” lower bound redundancy result in SODA ‘95 paper (2-d range queries) – Empirical/statistical studies
  14. Future work • Improve results for range queries – prove

    conjecture – non-grid-point workloads – Restrict aspect ratio of queries – upper bounds? Simple ones are easy, but… • 2-d range queries: 2B1/(2r) + 2 • Set inclusion workloads – shed light on “anomalous” workloads – isomorphisms between workloads
  15. More future work • Dynamic/online version of problem – insertion/deletion

    to index – growing query set (learning theory?) • Complexity of indexing schemes – How hard is it to find a “good” indexing scheme for a workload? – How hard is it to find a covering set of blocks for a query?
  16. More applied future work • Indexability and constructive results for

    natural workloads – R.E. queries over strings – spatial layout queries over images – near-neighbor queries in n-d – harmonic-progression queries over MIDI. – Etc. • Use performance measures in empirical analysis – spatial index benchmarking
  17. Wanted: Theory! • This is a genuine systems challenge –

    amenable to interesting theory – framework exists for implementing/testing results/conjectures in commercial systems! • Opportunity for a nice theory/systems feedback loop • More info? – http://gist.cs.berkeley.edu