Ronak Kogta
September 06, 2019
27

# Case of Learned Indexed Structures -Paper review

## Ronak Kogta

September 06, 2019

## Transcript

2. ### Essential question we want to ask Given a query Q,

and a dataset D, how we can do the following operations • Performing a range query in a given dataset • Performing a point query in a given dataset • Whether a point exists in given dataset or not • Sorting the dataset • Computing min-max element of dataset • Finding minimum distance between two elements, if dataset is modeled as graph • …
3. ### Essential question answered, Historically • It started with Art of

computer programming • Find information theoretic proof of upper bound • Convert proof to algorithm • Describe complexity of each operation • Insert • Delete • Read • …
4. ### Essential question answered, Historically • We came up with so

many data structure and algorithms to solve these problems effectively • Binary trees • Graph Algortihms like Dijkstra or Floyd Marshall • B+trees • Hashmaps • Mergesort, Quicksort, Selection sort, Bubble sort • Bloom filters • Heap • Stack • …
5. ### Core assumption we made Assumption about data, on which they

operate is not made, and as a result, we also try to present • Worst case scenarios • Best case scenarios • Average case scenarios
6. ### Debunking assumption For the most of the workloads, data is

specific/tainted. Data Structure == Models Workload Write an algorithm Characterize it Workload Run Query Model it
7. ### Pattern learning at low cost • It pays a lot

to hand tune an algorithm according to a data distribution. • What we need is the desire to • Understand data distribution patterns • Way to handle them at cheap cost without much of human effort • ML started with recognizing patterns about data • Linear Regression model • Polynomical regression model • PCA (Principle component Analysis) – to detect feature vectors • SVM
8. ### ML opens up an opportunity to learn model that reflects

pattern in data, and thus helping us to discover what we call Learned Indexes
9. ### Paper proposes • Many data structure can be decomposed into

learned indexes and auxiliary structure • Data distributions can specially modelled by CDF • Developments in CPU/GPU/TPU might make “otherwise expensive” ML models to compute in lesser (time, storage), as traditional data structure • NN specially able to learn wide variety of data distirbutions, mixtures, variety • Challenge is to balance model with its complexity
10. ### Let’s talk about single range indexes • Speeds up searches

for subset of record, based on values in certain “search key” fields • They are cache-friendly • They support concurrency • Allows key compression • Has bounded cost for inserts and lookups • Efficient for 200 < price < 500 like queries. • Data needs to be sorted. • Index selection problem
11. ### B+ trees Basic Structure • Leaves are linked list •

Has d degree (fan out factor) • Each node/leaf >= d and <= 2d keys, except for root. Operation • Insert operation • Find a leaf where k belongs • If no overflow, halt • If overflowed, split the tree, insert in parent • Delete operation • After deleting the tree, need to rebalnace (Rotate) • If rotation is not possible, then need to merge
12. ### Range Index Model are CDF model p=F(Key)∗N Where p is

the position estimate, F(Key) is the estimated cumulative distribution function for the data to estimate the likelihood to observe a key smaller or equal to the lookup key P(X≤Key), and N is the total number of keys
13. ### Naïve implementaion • 2 layer fully funcitonal, ReLu activated 32

neuron functions per layer • 1250 predictions per/sec in tensorflow • 80000 ns for predicting input, not actual search • as compared to 300 ns for b tree, 900 ns for binary search • Reasons for this performance • Tensorflow not optimized for small models • B-trees are good in overfitting, as they rebalance/merge themselves from time to time • Other models, may predict, but still may need not be able to guess at indiviudal data level
14. ### RM-Index • Learning Index Framework • Generates index configurations, optimizes

and test them via help of tensorflow • Kind of precomputation work and generates in c++ • Has to take ml models, page sizes, search strategies into consideration. • Recursive Model Index • One model does not fits all. Can use mixture of models. • Kind of like building data matter experts • Iteratively build each layer with loss to build complete model
15. ### RM-Index • Hybrid index • At top, use model which

can capture broad type of patterns • At lower level, use depth based models, or even B-trees or simpler regression models • Hybrid indexes allow us to bound the worst case performance of learned indexes to the performance of B-Trees
16. ### Search Strategies • Model Biased Search • Biased Quternery search

• pos−σ,pos,pos+σ • Indexing strings • Not covered in this scope.
17. ### Monotonic Constriants • It is often the case in a

modeling problem or project that the functional form of an acceptable model is constrained in some way. This may happen due to business considerations, or because of the type of scientific question being investigated. In some cases, where there is a very strong prior belief that the true relationship has some quality, constraints can be used to improve the predictive performance of the model. • A common type of constraint in this situation is that certain features bear a monotonic relationship to the predicted response: • (1,2,…,,…,−1,) ≤ (1,2,…,ʹ,…,−1,) • whenever ≤ʹ • is an increasing constraint; or • (1,2,…,,…,−1,)≥(1,2,…,ʹ,…,−1,) • whenever ≤ʹ • is a decreasing constraint. • ability to enforce monotonicity constraints on any features used in a boosted model.