Slide 1

Slide 1 text

The Case for Learned Index Structures Ronak Kogta

Slide 2

Slide 2 text

Essential question we want to ask Given a query Q, and a dataset D, how we can do the following operations • Performing a range query in a given dataset • Performing a point query in a given dataset • Whether a point exists in given dataset or not • Sorting the dataset • Computing min-max element of dataset • Finding minimum distance between two elements, if dataset is modeled as graph • …

Slide 3

Slide 3 text

Essential question answered, Historically • It started with Art of computer programming • Find information theoretic proof of upper bound • Convert proof to algorithm • Describe complexity of each operation • Insert • Delete • Read • …

Slide 4

Slide 4 text

Essential question answered, Historically • We came up with so many data structure and algorithms to solve these problems effectively • Binary trees • Graph Algortihms like Dijkstra or Floyd Marshall • B+trees • Hashmaps • Mergesort, Quicksort, Selection sort, Bubble sort • Bloom filters • Heap • Stack • …

Slide 5

Slide 5 text

Core assumption we made Assumption about data, on which they operate is not made, and as a result, we also try to present • Worst case scenarios • Best case scenarios • Average case scenarios

Slide 6

Slide 6 text

Debunking assumption For the most of the workloads, data is specific/tainted. Data Structure == Models Workload Write an algorithm Characterize it Workload Run Query Model it

Slide 7

Slide 7 text

Pattern learning at low cost • It pays a lot to hand tune an algorithm according to a data distribution. • What we need is the desire to • Understand data distribution patterns • Way to handle them at cheap cost without much of human effort • ML started with recognizing patterns about data • Linear Regression model • Polynomical regression model • PCA (Principle component Analysis) – to detect feature vectors • SVM

Slide 8

Slide 8 text

ML opens up an opportunity to learn model that reflects pattern in data, and thus helping us to discover what we call Learned Indexes

Slide 9

Slide 9 text

Paper proposes • Many data structure can be decomposed into learned indexes and auxiliary structure • Data distributions can specially modelled by CDF • Developments in CPU/GPU/TPU might make “otherwise expensive” ML models to compute in lesser (time, storage), as traditional data structure • NN specially able to learn wide variety of data distirbutions, mixtures, variety • Challenge is to balance model with its complexity

Slide 10

Slide 10 text

Let’s talk about single range indexes • Speeds up searches for subset of record, based on values in certain “search key” fields • They are cache-friendly • They support concurrency • Allows key compression • Has bounded cost for inserts and lookups • Efficient for 200 < price < 500 like queries. • Data needs to be sorted. • Index selection problem

Slide 11

Slide 11 text

B+ trees Basic Structure • Leaves are linked list • Has d degree (fan out factor) • Each node/leaf >= d and <= 2d keys, except for root. Operation • Insert operation • Find a leaf where k belongs • If no overflow, halt • If overflowed, split the tree, insert in parent • Delete operation • After deleting the tree, need to rebalnace (Rotate) • If rotation is not possible, then need to merge

Slide 12

Slide 12 text

Range Index Model are CDF model p=F(Key)∗N Where p is the position estimate, F(Key) is the estimated cumulative distribution function for the data to estimate the likelihood to observe a key smaller or equal to the lookup key P(X≤Key), and N is the total number of keys

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Naïve implementaion • 2 layer fully funcitonal, ReLu activated 32 neuron functions per layer • 1250 predictions per/sec in tensorflow • 80000 ns for predicting input, not actual search • as compared to 300 ns for b tree, 900 ns for binary search • Reasons for this performance • Tensorflow not optimized for small models • B-trees are good in overfitting, as they rebalance/merge themselves from time to time • Other models, may predict, but still may need not be able to guess at indiviudal data level

Slide 15

Slide 15 text

RM-Index • Learning Index Framework • Generates index configurations, optimizes and test them via help of tensorflow • Kind of precomputation work and generates in c++ • Has to take ml models, page sizes, search strategies into consideration. • Recursive Model Index • One model does not fits all. Can use mixture of models. • Kind of like building data matter experts • Iteratively build each layer with loss to build complete model

Slide 16

Slide 16 text

RM-Index • Hybrid index • At top, use model which can capture broad type of patterns • At lower level, use depth based models, or even B-trees or simpler regression models • Hybrid indexes allow us to bound the worst case performance of learned indexes to the performance of B-Trees

Slide 17

Slide 17 text

Search Strategies • Model Biased Search • Biased Quternery search • pos−σ,pos,pos+σ • Indexing strings • Not covered in this scope.

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Monotonic Constriants • It is often the case in a modeling problem or project that the functional form of an acceptable model is constrained in some way. This may happen due to business considerations, or because of the type of scientific question being investigated. In some cases, where there is a very strong prior belief that the true relationship has some quality, constraints can be used to improve the predictive performance of the model. • A common type of constraint in this situation is that certain features bear a monotonic relationship to the predicted response: • (1,2,…,,…,−1,) ≤ (1,2,…,ʹ,…,−1,) • whenever ≤ʹ • is an increasing constraint; or • (1,2,…,,…,−1,)≥(1,2,…,ʹ,…,−1,) • whenever ≤ʹ • is a decreasing constraint. • ability to enforce monotonicity constraints on any features used in a boosted model.