In-Database analytics with Python and MonetDB

IN-DATABASE ANALYTICS WITH PYTHON AND MONETDB PyCon Sette Firenze, 17th
April 2016 Gianluca Emireni - [email protected] Advanced & Predictive Analytics senior consultant @Horsa

RELATIONAL DATABASES

RDBMS The relational model for database management (E. Codd) is
based on ﬁrst-order predicate logic and on set theory. All data are represented in terms of tuples, grouped into relations. The model is strictly bound to relational algebra concepts, identifying relations with tables, attributes with ﬁelds and tuples with records. SQL language is the most used DML standard.

ROW AND COLUMN- ORIENTED RDBMS

ROW AND COLUMN-ORIENTED RDBMS column-oriented ➤ serialize columns/groups of columns
➤ columns values (of the same type, biggest compression) are stored in segments - contiguous memory areas ➤ variable-width fields are accessed by pointers ➤ columns are processed like arrays (parallelization/pipelining) row-oriented ➤ serialize records ➤ groups of records are written in pages, in order to reduce disk access ➤ variable-width fields are accessed by pointers ➤ pages are not fully filled for later expensive operations (e.g. updates)

AN OPEN CHALLENGE column-oriented OLAP row-oriented OLTP Aggregations on lots
of rows, reduced column subset Access to all/many columns in a single row Adding a column to a table Updates/inserts

ROW-ORIENTED WINS ➤ OLTP-like scenarios: interactive transactions with lots of
upserts/delete ➤ full record/multi-column fetching

COLUMN-ORIENTED WINS ➤ OLAP-like scenarios: segments’ metadata fetching is a
cheap operation ➤ does not require index creation and management ➤ fast processing in CRM and DWH environments, where usually tables grow “in width” ➤ eﬃcient compression algorithms, working on data of the same type (CPU cost for uncompressing data is lower than disk access cost)

BIG DATA AND COLUMNAR DATABASES https://speakerdeck.com/francescalted/francesc-alted-1 Columnar approach adapts better
to newer computer architectures: the gap between CPU cycle time and memory access time is getting wider. Better compression algorithms means: t(transmission and decompression in cache) < t(direct transfer) Big data world is shifting towards columnar databases. Bcolz: columnar, compressed chunked data containers using Blosc library.

TWO GROUPS OF COLUMNAR DB A B data model multi-dimensional
mapping relational data model column independence groups of columns are stored together every columns is stored individually language NoSQL SQL workload few reads, more upserts more reads, few upserts storage sparse column-store dense column-store (positional) http://dbmsmusings.blogspot.it/2010/03/distinguishing-two-major-types-of_29.html Group A: Google Bigtable, Apache HBase, Hypertable, Apache Cassandra Group B: SAP IQ, HP Vertica, Actian Vector, MonetDB, Infobright           

MONETDB

MONETDB http://www.monetdb.org/ MonetDB is a column-based RDBMS developed at Centrum
Wiskunde & Informatica (CWI) of Amsterdam. The MonetDB team is pioneering column-store technologies since 1993, the first official open-source release of MonetDB was in the year 2004. MonetDB purpose is to deliver high performance throughput in different applications: data mining, BI, OLAP , scientific databases, text/BLOB retrieval, GIS, etc.

ARCHITECTURE http://www.nesc.ac.uk/talks/1110/08Kersten.pdf MonetDB adopts full vertical fragmentation: ➤ every ﬁeld
is stored in a binary/2-column table called BAT (Binary Association Table), made of head and tail ➤ every table is internally represented as a collection of BATs ➤ every BAT stores attributes of the same tuple with the same key: C array for ﬁxed-width data, dictionaries for variable- width data.

EXECUTION MODEL http://www.slideshare.net/nikhilpatteri/monetdb-columnstore-approach-in-database ➤ MonetDB kernel is written in MAL
(MonetDB Assembly Language) ➤ complex operations are decomposed in simpler sequences of MAL relational algebra operators, that can be executed on the whole column.

SYSTEM LAYERS ➤ Front end: ➤ user-based data-model mapped on
BATs ➤ SQL queries translated in MAL ➤ Back end: ➤ MAL optimizer framework: binary relational algebra ➤ Kernel: ➤ BATs and data structures

INTEGRATION MonetDB client in languages Languages embedded in MonetDB MonetDB
Python client API (MAPI) Embedded Python/NumPy MonetDB.R library R-Integration MonetDB's Ruby API - MonetDB driver for Node.js - MonetDB Perl library - MonetDB PHP library https://www.monetdb.org/Documentation/Manuals/SQLreference/Programming Like many other column-oriented DB, MonetDB is integrated with diﬀerent programming languages

EMBEDDED PYTHON/NUMPY VS MAPI ➤ MAPI client is based on
sockets, making high dimensional databases analysis inefficients. ➤ The embedded Python/NumPy in MonetDB solution has been introduced to directly incorporate Python UDFs as they were plain SQL. ➤ efficient: user NumPy arrays, that are basically Python wrappers of C arrays ➤ fast: data transfer from MonetDB to Python does not require data copy ➤ parallel execution: supports parallel execution of mappable Python functions, directly from SQL queries ➤ flexible: all the Python modules can be used.

DATA TYPES MonetDB data type NumPy data type BOOLEAN numpy.int8
TINYINT numpy.int8 SMALLINT numpy.int16 INTEGER numpy.int32 BIGINT numpy.int64 REAL numpy.float32 FLOAT numpy.float64 HUGEINT numpy.float64 STRING numpy.object MonetDB data types are directly mapped on NumPy data types:

PARALLEL EXECUTION ➤ mitosis: MonetDB can speed up query execution
by splitting columns and executing queries in parallel, on diﬀerent threads. ➤ Not all queries can be executed in parallel, some of them require full column access. These are called blocking operations, the default for embedded Python/NumPy UDFs.

MAPPABLE UDF If the UDF can be mapped, you can
deﬁne it in MonetDB as PYTHON_MAP and perform parallel execution. Performances are deﬁnitely better(~3x) with respect to the same function called in Python.

UDFS CAN… ➤ return a single value ➤ return TABLE
type objects ➤ deﬁne aggregations

1. MY FIRST UDF CREATE TABLE integers(i INTEGER); INSERT INTO
integers VALUES (1), (2), (3), (4), (5); DROP FUNCTION python_times_two; CREATE FUNCTION python_times_two(i INTEGER) RETURNS INTEGER LANGUAGE PYTHON_MAP {return i * 2}; SELECT i, python_times_two(i) AS two_times_i FROM integers; i two_&mes_i 1 2 2 4 3 6 4 8 5 10

2. CREATING TABLES FROM A UDF CREATE FUNCTION random_integers(low INTEGER,
high INTEGER, amount INTEGER) RETURNS TABLE(value1 INTEGER, value2 INTEGER) LANGUAGE PYTHON { res = dict() res['value1'] = numpy.random.randint(low, high, size=(amount,)) res['value2'] = numpy.random.randint(low, high, size=(amount,)) return res }; DROP TABLE randint; CREATE TABLE randint AS SELECT * FROM random_integers(0, 10, 100000000) WITH DATA; SELECT * FROM randint LIMIT 10; value1 value2 3 7 4 8 6 2 0 8 3 0 … …

2. BUILT-IN MONETDB FUNCTION VS EMBEDDED UDF CREATE FUNCTION python_median(i
INTEGER) RETURNS INTEGER LANGUAGE PYTHON {return numpy.median(i)}; # built-in SELECT MEDIAN(value1) FROM randint; # SQL query time: 4,099 s. # embedded NumPy SELECT python_median(value1) FROM randint; # SQL query time: 1,185 s.

3. CREATING AGGREGATIONS FROM A UDF DROP TABLE sales; CREATE
TABLE sales(employee_name STRING, division STRING, category STRING, amount DECIMAL); INSERT INTO sales VALUES ('Loredana Ferrara', 'SAP', 'cross-selling', 994.93), ('Vitalba Ferrari', 'Analytics', 'presales', 2467.74), ('Mariano Bruno', 'e-commerce', 'sales', 1293.78), …; DROP AGGREGATE python_median; CREATE AGGREGATE python_median_aggr(val DECIMAL) RETURNS DECIMAL LANGUAGE PYTHON { unique = numpy.unique(aggr_group) x = numpy.zeros(shape=(unique.size)) for i in range(0, unique.size): x[i] = numpy.median(val[aggr_group==unique[i]]) return(x) };

3. RESULTS SELECT division, COUNT(*) AS salesmen, python_median_aggr(amount) AS median_amount
FROM sales GROUP BY division; division salesmen median_amount SAP 14 5.408 Analy3cs 12 4.573 e-commerce 15 5.077 Microso= 9 8.012 ERP 9 6.158 SAS 10 4.989 BI 19 5.018 … … … division category salesmen median  _amount SAP cross-selling 2 4.366 Analy3cs presales 3 2.467 e-commerce sales 4 1.526 SAP sales 3 6.161 SAP condi3onal sales 2 7.327 e-commerce up-selling 3 3.895 Microso= nego3a3on 2 8.125 … … … … SELECT division, category, COUNT(*) AS salesmen, python_median_aggr(amount) AS median_amount FROM sales GROUP BY division, category;

ANALYTICS IN MONETDB WITH SCIKIT-LEARN

ML ON MONETDB UDF deﬁnition in Python enables the possibility
to perform complex operations on datasets/tables. NumPy and Scikit-learn allow the application of machine learning techniques on data. The best performances are obtained using embedded Python/ NumPy UDFs in MonetDB.

BUSINESS CASE

THE CONTEXT Our customer is a company of the food
sector operating exclusively in B2B, normal-trade channel. It plans promotions in favor of ﬁnal customers, involving retailers. There are two main goals related to the promotions: ➤ the company is interested in estimating the quantity of gadgets to buy in order to satisfy the request ➤ the agents want to know if a retailer is interested in being involved in the promotion. We are facing a binary classiﬁcation problem: the retailer partecipates/doesn’t partecipate in a promotion.

THE DATA ➤ Our data come from the transactions table,
that counts about 120M rows in a 2 year period. ➤ We created and enriched a CRM table with 43.000 records/ retailers and about 150 features, where we observe the target variable: taking part in the promotion. ➤ The features are heterogeneous and related to: retailer category, economic behaviour (wrt to brand and product line), historical sales volumes, credit exposures, localization, attendance at previous promotions, RFM, etc.

THE APPROACH We are going to use NumPy, SciPy and
Scikit-learn libraries embedded in MonetDB to: A. preprocess original data B. split data in training/validation set C. train diﬀerent ML models D. obtain predicted values based on validation set features E. evaluate the quality of obtained classiﬁcations to predict if customers are taking part in the future promotions.

A. PREPROCESSING (LABEL ENCODING) CREATE FUNCTION data_preprocess (target STRING, feature1
DOUBLE, feature2 STRING, feature3 STRING, …, featureN DOUBLE) RETURNS TABLE(target INT, feature1 DOUBLE, feature2 INT, feature3 INT, …, featureN DOUBLE) LANGUAGE PYTHON { from sklearn import preprocessing result_columns = dict() # loop over all the columns for key in _columns.keys(): if _column_types[key] == 'STRING': # if the column is a string, we transform it le = preprocessing.LabelEncoder() # fit the labelencoder on the data le.fit(_columns[key]) # apply the labelencoder and store the result result_columns[key] = le.transform(_columns[key]) else: # if the column is not a string, we don't need to do anything result_columns[key] = _columns[key] return result_columns };

WRITING THE RESULT IN A TABLE CREATE TABLE data_preproc AS
SELECT * FROM data_preprocess(( SELECT * FROM mydata )) WITH DATA; ALTER TABLE data_preproc ADD id INTEGER NOT NULL AUTO_INCREMENT;

B. SPLITTING TRAIN/VALIDATION SET CREATE FUNCTION data_split(id INT) RETURNS TABLE(id
INT, train BOOLEAN) LANGUAGE PYTHON { count = len(id) # generate the indices indices = numpy.arange(count) # shuffle the indices numpy.random.shuffle(indices) # assign 70% of the values to the train set train_indices = indices[:int(count * 0.70)] # create a boolean array (true=test_set) train_set = numpy.zeros(count, dtype=numpy.bool) train_set[train_indices] = True return [id, train_set] }; CREATE TABLE train_set AS SELECT * FROM data_split(( SELECT id FROM data_preproc )) WITH DATA;

C. MODEL TRAINING CREATE FUNCTION my_model_train (target INT, feature1 DOUBLE,
feature2 INT, feature3 INT, …, featureN DOUBLE) RETURNS TABLE(cl_name STRING, cl_obj STRING) LANGUAGE PYTHON { import cPickle count = len(target) tgtvar = target # import classifier library from sklearn.ensemble import RandomForestClassifier # train the model clf = RandomForestClassifier(n_estimators=40) # exclude response variable from the feature set del _columns['target'] # construct a 2D array from the features data_array = numpy.array([]) for x in _columns.values(): data_array = numpy.concatenate((data_array, x)) data_array.shape = (count, len(_columns.keys())) # train the classifier clf.fit(data_array, tgtvar) # export the classifier to the database return dict(cl_name="Random Forest Classifier", cl_obj=cPickle.dumps(clf)) };

WRITING CLASSIFIERS TO TABLE CREATE TABLE Classifiers AS SELECT *
FROM my_model_train ( (SELECT target, feature1, …, featureN FROM promodata_preproc a INNER JOIN train_set b ON a.id = b.id WHERE b.train=true) ) WITH DATA;

D. PREDICTING CREATE FUNCTION my_model_predict (target INT, feature1 DOUBLE, feature2
INT, feature3 INT, …, featureN DOUBLE) RETURNS TABLE(id INT, prediction STRING) LANGUAGE PYTHON { # don't use id for prediction del _columns['id'] # load the classifier using a loopback query import cPickle # first load the pickled object from the database res = _conn.execute("""SELECT cl_obj FROM Classifiers WHERE cl_name='Random Forest Classifier'""") # Unpickle the string to recreate the classifier classifier = cPickle.loads(res['cl_obj’][0]) # create a 2D array of the features data_array = numpy.array([]) for x in _columns.values(): data_array = numpy.concatenate((data_array, x)) data_array.shape = (len(id), len(_columns.keys())) # perform the actual classification result = dict() result['prediction'] = classifier.predict(data_array) result['id'] = id return result };

WRITING PREDICT VALUES TO TABLE CREATE TABLE predicted AS SELECT
* FROM my_model_predict ( (SELECT target, feature1, …, featureN, a.id FROM data_preproc a INNER JOIN train_set b ON a.id = b.id WHERE b.train=false) ) WITH DATA;

E. GETTING PREDICTIVE PERFORMANCE SELECT * FROM conf_matrix_stats ( (SELECT
* FROM scikit_conf_matrix ( (SELECT a.target*1.00 AS y_true, b.prediction*1.00 AS y_pred FROM promodata_preproc a INNER JOIN predicted b ON a.id = b.id)) ) );

CONFUSION MATRIX EVALUATION UDF CREATE FUNCTION scikit_conf_matrix (y_true INT, y_pred
INT) RETURNS TABLE(col1 INT, col2 INT) LANGUAGE PYTHON { from sklearn.metrics import confusion_matrix cfm = confusion_matrix(y_true, y_pred) return cfm }; CREATE FUNCTION conf_matrix_stats (c1 INT, c2 INT) RETURNS TABLE (accuracy FLOAT, precision FLOAT, sensitivity FLOAT, specificity FLOAT, f1 FLOAT) LANGUAGE PYTHON { result = dict() TP = c2[1]*1.00 TN = c1[0]*1.00 FN = c2[0]*1.00 FP = c1[1]*1.00 N = TP+TN+FP+FN accuracy = (TP + TN)/N precision = TP / (TP + FP) sensitivity = TP / (TP + FN) specificity = TN / (TN + FP) F1 = 2*TP / (2*TP + FP + FN) result['accuracy'] = accuracy result['precision'] = precision result['sensitivity'] = sensitivity result['specificity'] = specificity result['f1'] = F1 return result };

CLASSIFICATION MODELS http://scikit-learn.org/stable/supervised_learning.html ➤ Random forest ➤ ADA Boost ➤
Decision tree ➤ Extra tree ➤ Gradient boosting ➤ Stochastic gradient descent ➤ KNN (Nearest centroid) ➤ Mixed approach (Majority vote)

EVALUATION OF BINARY CLASSIFICATION Estimated values 0 1 Real values
0 TN FP 1 FN TP N = TP+TN+FP+FN accuracy = (TP + TN) / N precision = TP / (TP + FP) sensitivity = TP / (TP + FN) specificity = TN / (TN + FP) F1 = 2*TP / (2*TP + FP + FN) https://en.wikipedia.org/wiki/Sensitivity_and_specificity

PERFORMANCE Accuracy Precision TPR (Sensitivity) TNR  (Speciﬁcity) F1 score Random
forest 66% 33% 10% 91% 15% ADA Boost 69% 40% 1% 99% 2% Decision tree 58% 31% 32% 70% 31% Extra tree 67% 31% 4% 95% 8% Gradient boosting 69% 31% 1% 99% 1% Stochastic gradient descent 56% 30% 34% 66% 32% KNN (Nearest centroid) 67% 31% 3% 98% 6% Majority Vote (gd,tree,rf) 61% 30% 19% 80% 24%

THANK YOU

In-Database analytics with Python and MonetDB

In-Database analytics with Python and MonetDB

Other Decks in Technology

Featured

Transcript