Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache MADlib AI/ML

Apache MADlib AI/ML

This presentation gives an overview of the Apache MADlib AI/ML project. It explains Apache MADlib AI/ML in terms of it's functionality, it's architecture, dependencies and also gives an SQL example.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

https://open-source-systems.blogspot.com/

Mike Frampton

June 14, 2020
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. What Is Apache MADlib ? • For scalable in-database analytics

    • Open source Apache 2.0 license • For machine learning in SQL • At big data scale • Offers graph, statistics, analytics, deep learning • Provides data-parallel implementations • For structured and unstructured data
  2. MADlib Prerequisites • Currently supports databases – PostgreSQL • Needs

    Python extension specified – Greenplum (distributed db) – Apache Hawq ( v1.12+ ) (distributed db) • Requires the GNU M4 Unix macro processor • Works with Python 2.6 and 2.7
  3. MADlib Architecture • MADlib has three main layers • Python

    driver functions – Main entry point from user input – Largely responsible for algorithm flow control – Validating input parameters – Executing SQL statements – Evaluating the results – Potentially looping to execute more SQL statements • Until some convergence criteria has been hit
  4. MADlib Architecture • MADlib has three main layers • C++

    implementations functions – C++ definitions of the core functions/aggregates • Needed for particular algorithms – Implemented in C++ rather than Python • For performance reasons
  5. MADlib Architecture • MADlib has three main layers • C++

    database abstraction layer – Provide a programming interface – Abstracts all the Postgres internal details – Provides support for different back end platforms – Focuses on the internal functionality • Rather than the platform integration logic
  6. MADlib Data Types and Transformations • Arrays and Matrices •

    Encoding Categorical Variables • Path • Pivot • Sessionize • Stemming
  7. MADlib Graph Functionality • All Pairs Shortest Path • Breadth-First

    Search • HITS • Measures • PageRank • Single Source Shortest Path • Weakly Connected Components
  8. MADlib Model Selection / Sampling • Model Selection – Cross

    Validation – Prediction Metrics – Train-Test Split • Sampling – Balanced Sampling – Stratified Sampling
  9. MADlib Statistics / Supervised Learning • Statistics – Descriptive Statistics

    – Inferential Statistics – Probability Functions • Supervised Learning – Conditional Random Field – k-Nearest Neighbors – Neural Network – Regression Models – Support Vector Machines – Tree Methods
  10. MADlib Time Series / Unsupervised Learning • Time Series Analysis

    – ARIMA • Unsupervised Learning – Association Rules – Clustering – Dimensionality Reduction – Topic Modelling
  11. MADlib Utilities • Columns to Vector • Database Functions •

    Linear Solvers • Mini-Batch Preprocessor • PMML Export • Term Frequency • Vector to Columns
  12. MADlib Deep Learning Example SQL • First define the model

    configurations to train • Meaning either model architectures or hyperparameters • Load them into a model selection table • The combination of model architectures and hyperparameters • Constitutes the model configurations to train • In the picture there are three model configurations • Represented by the three different purple shapes
  13. MADlib Deep Learning Example SQL • Once we have model

    combinations • In the model selection table • Call the fit function to train the models – In parallel. • In the picture the three orange shapes • Represent the three models that have been trained
  14. Available Books • See “Big Data Made Easy” – Apress

    Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  15. Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

    • See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration