Slide 1

Slide 1 text

Staying MAD : in-database analytics with MADlib Ike Okonkwo (@ikeondata) Zipfian Academy 3.3.14 Thursday, March 6, 14

Slide 2

Slide 2 text

What is MADlib ? • Started out as academic project • UC Berkeley • University of Wisconsin • University of Florida • Industry backing from Pivotal / EMC • Aims to become ‘CRAN’ for databases • Currently runs on Greenplum DB / PostgreSQL • Provides scalable analytics in SQL DBMS Thursday, March 6, 14

Slide 3

Slide 3 text

In-Database Analytics ? • Performing advanced analytics and machine directly in the database • MADlib implements machine learning algorithms in SQL • python drivers for complex tasks that require multiple iterations on data - MCMC, Gradient Descent • Calls optimized C/C++ Linear Algebra libraries for matrix math Thursday, March 6, 14

Slide 4

Slide 4 text

Being MAD Thursday, March 6, 14

Slide 5

Slide 5 text

Being MAD • Magnetic - consume all types of data Thursday, March 6, 14

Slide 6

Slide 6 text

Being MAD • Magnetic - consume all types of data • Agile - modeling, ETL, iterating Thursday, March 6, 14

Slide 7

Slide 7 text

Being MAD • Magnetic - consume all types of data • Agile - modeling, ETL, iterating • Deep - statistical and machine learning algorithms Thursday, March 6, 14

Slide 8

Slide 8 text

Being MAD • Magnetic - consume all types of data • Agile - modeling, ETL, iterating • Deep - statistical and machine learning algorithms • library of parallel and scalable tools Thursday, March 6, 14

Slide 9

Slide 9 text

Traditional BI / Analytics Database / Data Warehouse Analytics Reports Scripts Thursday, March 6, 14

Slide 10

Slide 10 text

Traditional BI / Analytics Database / Data Warehouse Analytics Reports Scripts Thursday, March 6, 14

Slide 11

Slide 11 text

MADlib Database / Data Warehouse Analytics Thursday, March 6, 14

Slide 12

Slide 12 text

MADlib Tools • Supervised Learning • Linear Regression, Logistic Regression, Naive Bayes,Decision Trees, SVM • Unsupervised Learning • K-means, SVD, LDA, Association Rules • Support Modules • Sparse Matrices, Array Operation, Conjugate Gradient Optimization Thursday, March 6, 14

Slide 13

Slide 13 text

Under the hood Thursday, March 6, 14

Slide 14

Slide 14 text

Under the hood Lasso (L1) Ridge (L2) Linear Regression SVM Loss Functions MADlib uses Stochastic Gradient Descent under the hood to implement these models Thursday, March 6, 14

Slide 15

Slide 15 text

Open Source Interfaces • PyMADlib • PivotalR Thursday, March 6, 14

Slide 16

Slide 16 text

Other Approaches... • Hadoop Stack • Mahout, Cascading, Cascalog, MapReduce • BDA Stack • Spark • Others • Graphlab, Vowpal Wabbit • SAS HPA • $$$$ Thursday, March 6, 14

Slide 17

Slide 17 text

Want to learn more • Pivotal Open Source Hub meetup group • Topic - Big Data Analytics : Scalable machine learning using open-source tools • Where : Pivotal Labs, SF • When : 3.4.14 - 6.30pm Thursday, March 6, 14

Slide 18

Slide 18 text

References • http://www.slideshare.net/jdegoes/indatabase-predictive-analytics • http://www.slideshare.net/SrivatsanRamanujam/py-ma-dlibdatadaytexas • http://www.slideshare.net/SrivatsanRamanujam/python-powered-data-science-at-pivotal- pydata-2013?v=qf1&b=&from_search=9 • http://www.slideshare.net/SarahAerni/data-science-as-a-commodity-use-madlib-r-other-oss- tools-for-data-science-from-pivotal-open-source-hub-meetup?v=qf1&b=&from_search=6 • http://www.slideshare.net/christangrant/mad-skills-9012267?v=qf1&b=&from_search=11 • http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-38.pdf • http://sacan.biomed.drexel.edu/vldb2012/program/? volno=vol5no12&pid=935&downloadslides=1 • Introduction to Statistical Learning, Hastie, Tibshirani, et al Thursday, March 6, 14

Slide 19

Slide 19 text

Thursday, March 6, 14

Slide 20

Slide 20 text

Backup Slides • array operations implemented as UDFs • UDA and window aggregates • Common use case : Legacy Data Warehouse with no flexibility • supports statistical text analysis - feature extraction, string matching (n-grams), Viterbi and MCMC inference Thursday, March 6, 14