Case of Learned Indexed Structures -Paper review

The Case for Learned Index Structures Ronak Kogta

Essential question we want to ask Given a query Q,
and a dataset D, how we can do the following operations • Performing a range query in a given dataset • Performing a point query in a given dataset • Whether a point exists in given dataset or not • Sorting the dataset • Computing min-max element of dataset • Finding minimum distance between two elements, if dataset is modeled as graph • …

Essential question answered, Historically • It started with Art of
computer programming • Find information theoretic proof of upper bound • Convert proof to algorithm • Describe complexity of each operation • Insert • Delete • Read • …

Essential question answered, Historically • We came up with so
many data structure and algorithms to solve these problems effectively • Binary trees • Graph Algortihms like Dijkstra or Floyd Marshall • B+trees • Hashmaps • Mergesort, Quicksort, Selection sort, Bubble sort • Bloom filters • Heap • Stack • …

Core assumption we made Assumption about data, on which they
operate is not made, and as a result, we also try to present • Worst case scenarios • Best case scenarios • Average case scenarios

Debunking assumption For the most of the workloads, data is
specific/tainted. Data Structure == Models Workload Write an algorithm Characterize it Workload Run Query Model it

Pattern learning at low cost • It pays a lot
to hand tune an algorithm according to a data distribution. • What we need is the desire to • Understand data distribution patterns • Way to handle them at cheap cost without much of human effort • ML started with recognizing patterns about data • Linear Regression model • Polynomical regression model • PCA (Principle component Analysis) – to detect feature vectors • SVM

ML opens up an opportunity to learn model that reflects
pattern in data, and thus helping us to discover what we call Learned Indexes

Paper proposes • Many data structure can be decomposed into
learned indexes and auxiliary structure • Data distributions can specially modelled by CDF • Developments in CPU/GPU/TPU might make “otherwise expensive” ML models to compute in lesser (time, storage), as traditional data structure • NN specially able to learn wide variety of data distirbutions, mixtures, variety • Challenge is to balance model with its complexity

Let’s talk about single range indexes • Speeds up searches
for subset of record, based on values in certain “search key” fields • They are cache-friendly • They support concurrency • Allows key compression • Has bounded cost for inserts and lookups • Efficient for 200 < price < 500 like queries. • Data needs to be sorted. • Index selection problem

B+ trees Basic Structure • Leaves are linked list •
Has d degree (fan out factor) • Each node/leaf >= d and <= 2d keys, except for root. Operation • Insert operation • Find a leaf where k belongs • If no overflow, halt • If overflowed, split the tree, insert in parent • Delete operation • After deleting the tree, need to rebalnace (Rotate) • If rotation is not possible, then need to merge

Range Index Model are CDF model p=F(Key)∗N Where p is
the position estimate, F(Key) is the estimated cumulative distribution function for the data to estimate the likelihood to observe a key smaller or equal to the lookup key P(X≤Key), and N is the total number of keys

Naïve implementaion • 2 layer fully funcitonal, ReLu activated 32
neuron functions per layer • 1250 predictions per/sec in tensorflow • 80000 ns for predicting input, not actual search • as compared to 300 ns for b tree, 900 ns for binary search • Reasons for this performance • Tensorflow not optimized for small models • B-trees are good in overfitting, as they rebalance/merge themselves from time to time • Other models, may predict, but still may need not be able to guess at indiviudal data level

RM-Index • Learning Index Framework • Generates index configurations, optimizes
and test them via help of tensorflow • Kind of precomputation work and generates in c++ • Has to take ml models, page sizes, search strategies into consideration. • Recursive Model Index • One model does not fits all. Can use mixture of models. • Kind of like building data matter experts • Iteratively build each layer with loss to build complete model

RM-Index • Hybrid index • At top, use model which
can capture broad type of patterns • At lower level, use depth based models, or even B-trees or simpler regression models • Hybrid indexes allow us to bound the worst case performance of learned indexes to the performance of B-Trees

Search Strategies • Model Biased Search • Biased Quternery search
• pos−σ,pos,pos+σ • Indexing strings • Not covered in this scope.

Monotonic Constriants • It is often the case in a
modeling problem or project that the functional form of an acceptable model is constrained in some way. This may happen due to business considerations, or because of the type of scientific question being investigated. In some cases, where there is a very strong prior belief that the true relationship has some quality, constraints can be used to improve the predictive performance of the model. • A common type of constraint in this situation is that certain features bear a monotonic relationship to the predicted response: • (1,2,…,,…,−1,) ≤ (1,2,…,ʹ,…,−1,) • whenever ≤ʹ • is an increasing constraint; or • (1,2,…,,…,−1,)≥(1,2,…,ʹ,…,−1,) • whenever ≤ʹ • is a decreasing constraint. • ability to enforce monotonicity constraints on any features used in a boosted model.

Case of Learned Indexed Structures -Paper review

Case of Learned Indexed Structures -Paper review

Ronak Kogta

More Decks by Ronak Kogta

Other Decks in Research

Featured

Transcript

The Case for Learned Index Structures Ronak Kogta

Essential question we want to ask Given a query Q,

Essential question answered, Historically • It started with Art of

Essential question answered, Historically • We came up with so

Core assumption we made Assumption about data, on which they

Debunking assumption For the most of the workloads, data is

Pattern learning at low cost • It pays a lot

ML opens up an opportunity to learn model that reflects

Paper proposes • Many data structure can be decomposed into

Let’s talk about single range indexes • Speeds up searches

B+ trees Basic Structure • Leaves are linked list •

Range Index Model are CDF model p=F(Key)∗N Where p is

Naïve implementaion • 2 layer fully funcitonal, ReLu activated 32

RM-Index • Learning Index Framework • Generates index configurations, optimizes

RM-Index • Hybrid index • At top, use model which

Search Strategies • Model Biased Search • Biased Quternery search

Monotonic Constriants • It is often the case in a