Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Staying MAD : in-database analytics with MADlib

Staying MAD : in-database analytics with MADlib

Ike Okonkwo

March 03, 2014
Tweet

More Decks by Ike Okonkwo

Other Decks in Technology

Transcript

  1. Staying MAD : in-database analytics with MADlib Ike Okonkwo (@ikeondata)

    Zipfian Academy 3.3.14 Thursday, March 6, 14
  2. What is MADlib ? • Started out as academic project

    • UC Berkeley • University of Wisconsin • University of Florida • Industry backing from Pivotal / EMC • Aims to become ‘CRAN’ for databases • Currently runs on Greenplum DB / PostgreSQL • Provides scalable analytics in SQL DBMS Thursday, March 6, 14
  3. In-Database Analytics ? • Performing advanced analytics and machine directly

    in the database • MADlib implements machine learning algorithms in SQL • python drivers for complex tasks that require multiple iterations on data - MCMC, Gradient Descent • Calls optimized C/C++ Linear Algebra libraries for matrix math Thursday, March 6, 14
  4. Being MAD • Magnetic - consume all types of data

    • Agile - modeling, ETL, iterating Thursday, March 6, 14
  5. Being MAD • Magnetic - consume all types of data

    • Agile - modeling, ETL, iterating • Deep - statistical and machine learning algorithms Thursday, March 6, 14
  6. Being MAD • Magnetic - consume all types of data

    • Agile - modeling, ETL, iterating • Deep - statistical and machine learning algorithms • library of parallel and scalable tools Thursday, March 6, 14
  7. MADlib Tools • Supervised Learning • Linear Regression, Logistic Regression,

    Naive Bayes,Decision Trees, SVM • Unsupervised Learning • K-means, SVD, LDA, Association Rules • Support Modules • Sparse Matrices, Array Operation, Conjugate Gradient Optimization Thursday, March 6, 14
  8. Under the hood Lasso (L1) Ridge (L2) Linear Regression SVM

    Loss Functions MADlib uses Stochastic Gradient Descent under the hood to implement these models Thursday, March 6, 14
  9. Other Approaches... • Hadoop Stack • Mahout, Cascading, Cascalog, MapReduce

    • BDA Stack • Spark • Others • Graphlab, Vowpal Wabbit • SAS HPA • $$$$ Thursday, March 6, 14
  10. Want to learn more • Pivotal Open Source Hub meetup

    group • Topic - Big Data Analytics : Scalable machine learning using open-source tools • Where : Pivotal Labs, SF • When : 3.4.14 - 6.30pm Thursday, March 6, 14
  11. References • http://www.slideshare.net/jdegoes/indatabase-predictive-analytics • http://www.slideshare.net/SrivatsanRamanujam/py-ma-dlibdatadaytexas • http://www.slideshare.net/SrivatsanRamanujam/python-powered-data-science-at-pivotal- pydata-2013?v=qf1&b=&from_search=9 • http://www.slideshare.net/SarahAerni/data-science-as-a-commodity-use-madlib-r-other-oss-

    tools-for-data-science-from-pivotal-open-source-hub-meetup?v=qf1&b=&from_search=6 • http://www.slideshare.net/christangrant/mad-skills-9012267?v=qf1&b=&from_search=11 • http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-38.pdf • http://sacan.biomed.drexel.edu/vldb2012/program/? volno=vol5no12&pid=935&downloadslides=1 • Introduction to Statistical Learning, Hastie, Tibshirani, et al Thursday, March 6, 14
  12. Backup Slides • array operations implemented as UDFs • UDA

    and window aggregates • Common use case : Legacy Data Warehouse with no flexibility • supports statistical text analysis - feature extraction, string matching (n-grams), Viterbi and MCMC inference Thursday, March 6, 14