and a dataset D, how we can do the following operations • Performing a range query in a given dataset • Performing a point query in a given dataset • Whether a point exists in given dataset or not • Sorting the dataset • Computing min-max element of dataset • Finding minimum distance between two elements, if dataset is modeled as graph • …
to hand tune an algorithm according to a data distribution. • What we need is the desire to • Understand data distribution patterns • Way to handle them at cheap cost without much of human effort • ML started with recognizing patterns about data • Linear Regression model • Polynomical regression model • PCA (Principle component Analysis) – to detect feature vectors • SVM
learned indexes and auxiliary structure • Data distributions can specially modelled by CDF • Developments in CPU/GPU/TPU might make “otherwise expensive” ML models to compute in lesser (time, storage), as traditional data structure • NN specially able to learn wide variety of data distirbutions, mixtures, variety • Challenge is to balance model with its complexity
for subset of record, based on values in certain “search key” fields • They are cache-friendly • They support concurrency • Allows key compression • Has bounded cost for inserts and lookups • Efficient for 200 < price < 500 like queries. • Data needs to be sorted. • Index selection problem
Has d degree (fan out factor) • Each node/leaf >= d and <= 2d keys, except for root. Operation • Insert operation • Find a leaf where k belongs • If no overflow, halt • If overflowed, split the tree, insert in parent • Delete operation • After deleting the tree, need to rebalnace (Rotate) • If rotation is not possible, then need to merge
the position estimate, F(Key) is the estimated cumulative distribution function for the data to estimate the likelihood to observe a key smaller or equal to the lookup key P(X≤Key), and N is the total number of keys
neuron functions per layer • 1250 predictions per/sec in tensorflow • 80000 ns for predicting input, not actual search • as compared to 300 ns for b tree, 900 ns for binary search • Reasons for this performance • Tensorflow not optimized for small models • B-trees are good in overfitting, as they rebalance/merge themselves from time to time • Other models, may predict, but still may need not be able to guess at indiviudal data level
and test them via help of tensorflow • Kind of precomputation work and generates in c++ • Has to take ml models, page sizes, search strategies into consideration. • Recursive Model Index • One model does not fits all. Can use mixture of models. • Kind of like building data matter experts • Iteratively build each layer with loss to build complete model
can capture broad type of patterns • At lower level, use depth based models, or even B-trees or simpler regression models • Hybrid indexes allow us to bound the worst case performance of learned indexes to the performance of B-Trees
modeling problem or project that the functional form of an acceptable model is constrained in some way. This may happen due to business considerations, or because of the type of scientific question being investigated. In some cases, where there is a very strong prior belief that the true relationship has some quality, constraints can be used to improve the predictive performance of the model. • A common type of constraint in this situation is that certain features bear a monotonic relationship to the predicted response: • (1,2,…,,…,−1,) ≤ (1,2,…,ʹ,…,−1,) • whenever ≤ʹ • is an increasing constraint; or • (1,2,…,,…,−1,)≥(1,2,…,ʹ,…,−1,) • whenever ≤ʹ • is a decreasing constraint. • ability to enforce monotonicity constraints on any features used in a boosted model.