Upgrade to Pro — share decks privately, control downloads, hide ads and more …

In-Database analytics with Python and MonetDB

In-Database analytics with Python and MonetDB

Embedded Python/NumPy/Scikit-learn functions in MonetDB to perform machine-learning tasks on a real business case.

Gianluca Emireni

April 17, 2016
Tweet

Other Decks in Technology

Transcript

  1. IN-DATABASE ANALYTICS WITH PYTHON AND MONETDB PyCon Sette Firenze, 17th

    April 2016 Gianluca Emireni - [email protected] Advanced & Predictive Analytics senior consultant @Horsa
  2. RDBMS The relational model for database management (E. Codd) is

    based on first-order predicate logic and on set theory. All data are represented in terms of tuples, grouped into relations. The model is strictly bound to relational algebra concepts, identifying relations with tables, attributes with fields and tuples with records. SQL language is the most used DML standard.
  3. ROW AND COLUMN-ORIENTED RDBMS column-oriented ➤ serialize columns/groups of columns

    ➤ columns values (of the same type, biggest compression) are stored in segments - contiguous memory areas ➤ variable-width fields are accessed by pointers ➤ columns are processed like arrays (parallelization/pipelining) row-oriented ➤ serialize records ➤ groups of records are written in pages, in order to reduce disk access ➤ variable-width fields are accessed by pointers ➤ pages are not fully filled for later expensive operations (e.g. updates)
  4. AN OPEN CHALLENGE column-oriented OLAP row-oriented OLTP Aggregations on lots

    of rows, reduced column subset Access to all/many columns in a single row Adding a column to a table Updates/inserts
  5. ROW-ORIENTED WINS ➤ OLTP-like scenarios: interactive transactions with lots of

    upserts/delete ➤ full record/multi-column fetching
  6. COLUMN-ORIENTED WINS ➤ OLAP-like scenarios: segments’ metadata fetching is a

    cheap operation ➤ does not require index creation and management ➤ fast processing in CRM and DWH environments, where usually tables grow “in width” ➤ efficient compression algorithms, working on data of the same type (CPU cost for uncompressing data is lower than disk access cost)
  7. BIG DATA AND COLUMNAR DATABASES https://speakerdeck.com/francescalted/francesc-alted-1 Columnar approach adapts better

    to newer computer architectures: the gap between CPU cycle time and memory access time is getting wider. Better compression algorithms means: t(transmission and decompression in cache) < t(direct transfer) Big data world is shifting towards columnar databases. Bcolz: columnar, compressed chunked data containers using Blosc library.
  8. TWO GROUPS OF COLUMNAR DB A B data model multi-dimensional

    mapping relational data model column independence groups of columns are stored together every columns is stored individually language NoSQL SQL workload few reads, more upserts more reads, few upserts storage sparse column-store dense column-store (positional) http://dbmsmusings.blogspot.it/2010/03/distinguishing-two-major-types-of_29.html Group A: Google Bigtable, Apache HBase, Hypertable, Apache Cassandra Group B: SAP IQ, HP Vertica, Actian Vector, MonetDB, Infobright
 
 
 
 
 

  9. MONETDB http://www.monetdb.org/ MonetDB is a column-based RDBMS developed at Centrum

    Wiskunde & Informatica (CWI) of Amsterdam. The MonetDB team is pioneering column-store technologies since 1993, the first official open-source release of MonetDB was in the year 2004. MonetDB purpose is to deliver high performance throughput in different applications: data mining, BI, OLAP , scientific databases, text/BLOB retrieval, GIS, etc.
  10. ARCHITECTURE http://www.nesc.ac.uk/talks/1110/08Kersten.pdf MonetDB adopts full vertical fragmentation: ➤ every field

    is stored in a binary/2-column table called BAT (Binary Association Table), made of head and tail ➤ every table is internally represented as a collection of BATs ➤ every BAT stores attributes of the same tuple with the same key: C array for fixed-width data, dictionaries for variable- width data.
  11. EXECUTION MODEL http://www.slideshare.net/nikhilpatteri/monetdb-columnstore-approach-in-database ➤ MonetDB kernel is written in MAL

    (MonetDB Assembly Language) ➤ complex operations are decomposed in simpler sequences of MAL relational algebra operators, that can be executed on the whole column.
  12. SYSTEM LAYERS ➤ Front end: ➤ user-based data-model mapped on

    BATs ➤ SQL queries translated in MAL ➤ Back end: ➤ MAL optimizer framework: binary relational algebra ➤ Kernel: ➤ BATs and data structures
  13. INTEGRATION MonetDB client in languages Languages embedded in MonetDB MonetDB

    Python client API (MAPI) Embedded Python/NumPy MonetDB.R library R-Integration MonetDB's Ruby API - MonetDB driver for Node.js - MonetDB Perl library - MonetDB PHP library https://www.monetdb.org/Documentation/Manuals/SQLreference/Programming Like many other column-oriented DB, MonetDB is integrated with different programming languages
  14. EMBEDDED PYTHON/NUMPY VS MAPI ➤ MAPI client is based on

    sockets, making high dimensional databases analysis inefficients. ➤ The embedded Python/NumPy in MonetDB solution has been introduced to directly incorporate Python UDFs as they were plain SQL. ➤ efficient: user NumPy arrays, that are basically Python wrappers of C arrays ➤ fast: data transfer from MonetDB to Python does not require data copy ➤ parallel execution: supports parallel execution of mappable Python functions, directly from SQL queries ➤ flexible: all the Python modules can be used.
  15. DATA TYPES MonetDB data type NumPy data type BOOLEAN numpy.int8

    TINYINT numpy.int8 SMALLINT numpy.int16 INTEGER numpy.int32 BIGINT numpy.int64 REAL numpy.float32 FLOAT numpy.float64 HUGEINT numpy.float64 STRING numpy.object MonetDB data types are directly mapped on NumPy data types:
  16. PARALLEL EXECUTION ➤ mitosis: MonetDB can speed up query execution

    by splitting columns and executing queries in parallel, on different threads. ➤ Not all queries can be executed in parallel, some of them require full column access. These are called blocking operations, the default for embedded Python/NumPy UDFs.
  17. MAPPABLE UDF If the UDF can be mapped, you can

    define it in MonetDB as PYTHON_MAP and perform parallel execution. Performances are definitely better(~3x) with respect to the same function called in Python.
  18. UDFS CAN… ➤ return a single value ➤ return TABLE

    type objects ➤ define aggregations
  19. +

  20. 1. MY FIRST UDF CREATE TABLE integers(i INTEGER); INSERT INTO

    integers VALUES (1), (2), (3), (4), (5); DROP FUNCTION python_times_two; CREATE FUNCTION python_times_two(i INTEGER) RETURNS INTEGER LANGUAGE PYTHON_MAP {return i * 2}; SELECT i, python_times_two(i) AS two_times_i FROM integers; i two_&mes_i 1 2 2 4 3 6 4 8 5 10
  21. 2. CREATING TABLES FROM A UDF CREATE FUNCTION random_integers(low INTEGER,

    high INTEGER, amount INTEGER) RETURNS TABLE(value1 INTEGER, value2 INTEGER) LANGUAGE PYTHON { res = dict() res['value1'] = numpy.random.randint(low, high, size=(amount,)) res['value2'] = numpy.random.randint(low, high, size=(amount,)) return res }; DROP TABLE randint; CREATE TABLE randint AS SELECT * FROM random_integers(0, 10, 100000000) WITH DATA; SELECT * FROM randint LIMIT 10; value1 value2 3 7 4 8 6 2 0 8 3 0 … …
  22. 2. BUILT-IN MONETDB FUNCTION VS EMBEDDED UDF CREATE FUNCTION python_median(i

    INTEGER) RETURNS INTEGER LANGUAGE PYTHON {return numpy.median(i)}; # built-in SELECT MEDIAN(value1) FROM randint; # SQL query time: 4,099 s. # embedded NumPy SELECT python_median(value1) FROM randint; # SQL query time: 1,185 s.
  23. 3. CREATING AGGREGATIONS FROM A UDF DROP TABLE sales; CREATE

    TABLE sales(employee_name STRING, division STRING, category STRING, amount DECIMAL); INSERT INTO sales VALUES ('Loredana Ferrara', 'SAP', 'cross-selling', 994.93), ('Vitalba Ferrari', 'Analytics', 'presales', 2467.74), ('Mariano Bruno', 'e-commerce', 'sales', 1293.78), …; DROP AGGREGATE python_median; CREATE AGGREGATE python_median_aggr(val DECIMAL) RETURNS DECIMAL LANGUAGE PYTHON { unique = numpy.unique(aggr_group) x = numpy.zeros(shape=(unique.size)) for i in range(0, unique.size): x[i] = numpy.median(val[aggr_group==unique[i]]) return(x) };
  24. 3. RESULTS SELECT division, COUNT(*) AS salesmen, python_median_aggr(amount) AS median_amount

    FROM sales GROUP BY division; division salesmen median_amount SAP 14 5.408 Analy3cs 12 4.573 e-commerce 15 5.077 Microso= 9 8.012 ERP 9 6.158 SAS 10 4.989 BI 19 5.018 … … … division category salesmen median
 _amount SAP cross-selling 2 4.366 Analy3cs presales 3 2.467 e-commerce sales 4 1.526 SAP sales 3 6.161 SAP condi3onal sales 2 7.327 e-commerce up-selling 3 3.895 Microso= nego3a3on 2 8.125 … … … … SELECT division, category, COUNT(*) AS salesmen, python_median_aggr(amount) AS median_amount FROM sales GROUP BY division, category;
  25. ML ON MONETDB UDF definition in Python enables the possibility

    to perform complex operations on datasets/tables. NumPy and Scikit-learn allow the application of machine learning techniques on data. The best performances are obtained using embedded Python/ NumPy UDFs in MonetDB.
  26. THE CONTEXT Our customer is a company of the food

    sector operating exclusively in B2B, normal-trade channel. It plans promotions in favor of final customers, involving retailers. There are two main goals related to the promotions: ➤ the company is interested in estimating the quantity of gadgets to buy in order to satisfy the request ➤ the agents want to know if a retailer is interested in being involved in the promotion. We are facing a binary classification problem: the retailer partecipates/doesn’t partecipate in a promotion.
  27. THE DATA ➤ Our data come from the transactions table,

    that counts about 120M rows in a 2 year period. ➤ We created and enriched a CRM table with 43.000 records/ retailers and about 150 features, where we observe the target variable: taking part in the promotion. ➤ The features are heterogeneous and related to: retailer category, economic behaviour (wrt to brand and product line), historical sales volumes, credit exposures, localization, attendance at previous promotions, RFM, etc.
  28. THE APPROACH We are going to use NumPy, SciPy and

    Scikit-learn libraries embedded in MonetDB to: A. preprocess original data B. split data in training/validation set C. train different ML models D. obtain predicted values based on validation set features E. evaluate the quality of obtained classifications to predict if customers are taking part in the future promotions.
  29. A. PREPROCESSING (LABEL ENCODING) CREATE FUNCTION data_preprocess (target STRING, feature1

    DOUBLE, feature2 STRING, feature3 STRING, …, featureN DOUBLE) RETURNS TABLE(target INT, feature1 DOUBLE, feature2 INT, feature3 INT, …, featureN DOUBLE) LANGUAGE PYTHON { from sklearn import preprocessing result_columns = dict() # loop over all the columns for key in _columns.keys(): if _column_types[key] == 'STRING': # if the column is a string, we transform it le = preprocessing.LabelEncoder() # fit the labelencoder on the data le.fit(_columns[key]) # apply the labelencoder and store the result result_columns[key] = le.transform(_columns[key]) else: # if the column is not a string, we don't need to do anything result_columns[key] = _columns[key] return result_columns };
  30. WRITING THE RESULT IN A TABLE CREATE TABLE data_preproc AS

    SELECT * FROM data_preprocess(( SELECT * FROM mydata )) WITH DATA; ALTER TABLE data_preproc ADD id INTEGER NOT NULL AUTO_INCREMENT;
  31. B. SPLITTING TRAIN/VALIDATION SET CREATE FUNCTION data_split(id INT) RETURNS TABLE(id

    INT, train BOOLEAN) LANGUAGE PYTHON { count = len(id) # generate the indices indices = numpy.arange(count) # shuffle the indices numpy.random.shuffle(indices) # assign 70% of the values to the train set train_indices = indices[:int(count * 0.70)] # create a boolean array (true=test_set) train_set = numpy.zeros(count, dtype=numpy.bool) train_set[train_indices] = True return [id, train_set] }; CREATE TABLE train_set AS SELECT * FROM data_split(( SELECT id FROM data_preproc )) WITH DATA;
  32. C. MODEL TRAINING CREATE FUNCTION my_model_train (target INT, feature1 DOUBLE,

    feature2 INT, feature3 INT, …, featureN DOUBLE) RETURNS TABLE(cl_name STRING, cl_obj STRING) LANGUAGE PYTHON { import cPickle count = len(target) tgtvar = target # import classifier library from sklearn.ensemble import RandomForestClassifier # train the model clf = RandomForestClassifier(n_estimators=40) # exclude response variable from the feature set del _columns['target'] # construct a 2D array from the features data_array = numpy.array([]) for x in _columns.values(): data_array = numpy.concatenate((data_array, x)) data_array.shape = (count, len(_columns.keys())) # train the classifier clf.fit(data_array, tgtvar) # export the classifier to the database return dict(cl_name="Random Forest Classifier", cl_obj=cPickle.dumps(clf)) };
  33. WRITING CLASSIFIERS TO TABLE CREATE TABLE Classifiers AS SELECT *

    FROM my_model_train ( (SELECT target, feature1, …, featureN FROM promodata_preproc a INNER JOIN train_set b ON a.id = b.id WHERE b.train=true) ) WITH DATA;
  34. D. PREDICTING CREATE FUNCTION my_model_predict (target INT, feature1 DOUBLE, feature2

    INT, feature3 INT, …, featureN DOUBLE) RETURNS TABLE(id INT, prediction STRING) LANGUAGE PYTHON { # don't use id for prediction del _columns['id'] # load the classifier using a loopback query import cPickle # first load the pickled object from the database res = _conn.execute("""SELECT cl_obj FROM Classifiers WHERE cl_name='Random Forest Classifier'""") # Unpickle the string to recreate the classifier classifier = cPickle.loads(res['cl_obj’][0]) # create a 2D array of the features data_array = numpy.array([]) for x in _columns.values(): data_array = numpy.concatenate((data_array, x)) data_array.shape = (len(id), len(_columns.keys())) # perform the actual classification result = dict() result['prediction'] = classifier.predict(data_array) result['id'] = id return result };
  35. WRITING PREDICT VALUES TO TABLE CREATE TABLE predicted AS SELECT

    * FROM my_model_predict ( (SELECT target, feature1, …, featureN, a.id FROM data_preproc a INNER JOIN train_set b ON a.id = b.id WHERE b.train=false) ) WITH DATA;
  36. E. GETTING PREDICTIVE PERFORMANCE SELECT * FROM conf_matrix_stats ( (SELECT

    * FROM scikit_conf_matrix ( (SELECT a.target*1.00 AS y_true, b.prediction*1.00 AS y_pred FROM promodata_preproc a INNER JOIN predicted b ON a.id = b.id)) ) );
  37. CONFUSION MATRIX EVALUATION UDF CREATE FUNCTION scikit_conf_matrix (y_true INT, y_pred

    INT) RETURNS TABLE(col1 INT, col2 INT) LANGUAGE PYTHON { from sklearn.metrics import confusion_matrix cfm = confusion_matrix(y_true, y_pred) return cfm }; CREATE FUNCTION conf_matrix_stats (c1 INT, c2 INT) RETURNS TABLE (accuracy FLOAT, precision FLOAT, sensitivity FLOAT, specificity FLOAT, f1 FLOAT) LANGUAGE PYTHON { result = dict() TP = c2[1]*1.00 TN = c1[0]*1.00 FN = c2[0]*1.00 FP = c1[1]*1.00 N = TP+TN+FP+FN accuracy = (TP + TN)/N precision = TP / (TP + FP) sensitivity = TP / (TP + FN) specificity = TN / (TN + FP) F1 = 2*TP / (2*TP + FP + FN) result['accuracy'] = accuracy result['precision'] = precision result['sensitivity'] = sensitivity result['specificity'] = specificity result['f1'] = F1 return result };
  38. CLASSIFICATION MODELS http://scikit-learn.org/stable/supervised_learning.html ➤ Random forest ➤ ADA Boost ➤

    Decision tree ➤ Extra tree ➤ Gradient boosting ➤ Stochastic gradient descent ➤ KNN (Nearest centroid) ➤ Mixed approach (Majority vote)
  39. EVALUATION OF BINARY CLASSIFICATION Estimated values 0 1 Real values

    0 TN FP 1 FN TP N = TP+TN+FP+FN accuracy = (TP + TN) / N precision = TP / (TP + FP) sensitivity = TP / (TP + FN) specificity = TN / (TN + FP) F1 = 2*TP / (2*TP + FP + FN) https://en.wikipedia.org/wiki/Sensitivity_and_specificity
  40. PERFORMANCE Accuracy Precision TPR (Sensitivity) TNR
 (Specificity) F1 score Random

    forest 66% 33% 10% 91% 15% ADA Boost 69% 40% 1% 99% 2% Decision tree 58% 31% 32% 70% 31% Extra tree 67% 31% 4% 95% 8% Gradient boosting 69% 31% 1% 99% 1% Stochastic gradient descent 56% 30% 34% 66% 32% KNN (Nearest centroid) 67% 31% 3% 98% 6% Majority Vote (gd,tree,rf) 61% 30% 19% 80% 24%