On the Analysis of Indexing Schemes

On The Analysis of Indexing Schemes Joe Hellerstein (Berkeley) Elias
Koutsoupias (UCLA) Christos Papadimitriou (Berkeley)

Background: GiSTs • Generalized Search Tree (GiST) – an extensible
index structure Internal Nodes (directory) Leaf Nodes (linked list) pred pred ... – generalizes B+-tree, R-tree, TV-tree, many others… – see Kornacker’s SIGMOD talk for more details

The Big Picture • GiST: Turing Machine of indexing –
can index any data type – can support any set of queries!! • So what is an index? – A clustering scheme for data... – ...with a “directory” • For 2-ary storage: – cluster size = disk block – directory: high-fanout, balanced tree

Wanted: Theory of Indexability • Systems solution with a (big!)
theory problem – some things are more “indexable” than others – how to characterize this? • Keep faithful to the systems problem – cost metric is #I/Os – block size is a fundamental parameter!

Outline • Formalizing indexability • Lower bounds for some typical
workloads – range queries on n-d points – set/subset queries • Space/time tradeoffs – replicating items in the index • Future Work (lots!)

Framework • Indexing workload: (D, I, Q) – D: a
domain (e.g. Z, R2, P(Z)) – I Ì D: a (finite) instance – Q Ì P(I): a set of queries – workload:indexability » language:complexity • Indexing scheme – collection S = {S1 , …, Sn} of blocks – |Si | = B (200 or so), – scheme performance » algorithm perf. I S i i = !

Performance Measures • Access Overhead (time) – (# blocks to
cover Q) / (é|Q|/Bù) – Note: worst possible performance = B – access overhead of an indexing scheme is max access overhead over all queries Ideal performance Ideal performance Actual performance Actual performance

Performance measures, cont. • Storage Redundancy (space) – max #
of blocks containing an element of I – Average redundancy |S| / (|I |/B) Ideal storage Ideal storage Actual storage Actual storage

Range queries • Theorem: – Any indexing scheme of redundancy
1 for 2-d range queries has access overhead at least B1/2. – For d-d queries, lower bound is B1-1/d

The 2-d case – all n2 “grid” points – 1´B,
B´1 queries (2n2/B of them) – block S intersects x horizontal, y vertical “lines” – xy ³ B, so x+y ³ 2B1/2 – so S intersects at least 2B1/2 of the queries – # of intersecting query/block pairs ³ 2B1/2|S| = 2B1/2 n2/B – avg. # of blocks per query: B1/2. Q.E.D. – Note: worst & random access overhead = B1/2 – n-d case a straightforward extension n n S

Same problem, with redundancy • Theorem: – The access overhead
a and redundancy r must satisfy r a2log(2a2) ³ (logB)/2. • Proof – uses result from extremal set theory • (a/k/a Johnson’s Lemma, coding theory) – assumes n = W(B2) • Conjecture – r ³ log B/log a ?

Set/subset workloads – Domain is P({1,…,n}), n > rB2 –
Query: find all sets contained in s – Theorem: For each redundancy r, there exists a set workload with access overhead B. – Proof: instance is singleton sets {1}, …, {n}, query is subset of B items from {1, …, n}. • Each element can be in same block with at most rB elements • So there are n/(rB) ³ B elements such that no 2 of them are in the same block. Query with B of these takes B blocks.

Related work – Main memory structures – The Brown connection
• upper bounds and structures for 2-d range queries [Kanellakis, Ramswamy, Subramanian, Vengroff, Vitter, et al.] • special case of Thm 1 by Kanellakis, et al. in a recent version of PODS ‘93 paper • “additive” lower bound redundancy result in SODA ‘95 paper (2-d range queries) – Empirical/statistical studies

Future work • Improve results for range queries – prove
conjecture – non-grid-point workloads – Restrict aspect ratio of queries – upper bounds? Simple ones are easy, but… • 2-d range queries: 2B1/(2r) + 2 • Set inclusion workloads – shed light on “anomalous” workloads – isomorphisms between workloads

More future work • Dynamic/online version of problem – insertion/deletion
to index – growing query set (learning theory?) • Complexity of indexing schemes – How hard is it to find a “good” indexing scheme for a workload? – How hard is it to find a covering set of blocks for a query?

More applied future work • Indexability and constructive results for
natural workloads – R.E. queries over strings – spatial layout queries over images – near-neighbor queries in n-d – harmonic-progression queries over MIDI. – Etc. • Use performance measures in empirical analysis – spatial index benchmarking

Wanted: Theory! • This is a genuine systems challenge –
amenable to interesting theory – framework exists for implementing/testing results/conjectures in commercial systems! • Opportunity for a nice theory/systems feedback loop • More info? – http://gist.cs.berkeley.edu

On the Analysis of Indexing Schemes

On the Analysis of Indexing Schemes

Joe Hellerstein

More Decks by Joe Hellerstein

Other Decks in Technology

Featured

Transcript

On The Analysis of Indexing Schemes Joe Hellerstein (Berkeley) Elias

Background: GiSTs • Generalized Search Tree (GiST) – an extensible

The Big Picture • GiST: Turing Machine of indexing –

Wanted: Theory of Indexability • Systems solution with a (big!)

Outline • Formalizing indexability • Lower bounds for some typical

Framework • Indexing workload: (D, I, Q) – D: a

Performance Measures • Access Overhead (time) – (# blocks to

Performance measures, cont. • Storage Redundancy (space) – max #

Range queries • Theorem: – Any indexing scheme of redundancy

The 2-d case – all n2 “grid” points – 1´B,

Same problem, with redundancy • Theorem: – The access overhead

Set/subset workloads – Domain is P({1,…,n}), n > rB2 –

Related work – Main memory structures – The Brown connection

Future work • Improve results for range queries – prove

More future work • Dynamic/online version of problem – insertion/deletion

More applied future work • Indexability and constructive results for

Wanted: Theory! • This is a genuine systems challenge –