Slide 1

Slide 1 text

What Is Apache SystemML ? ● A machine learning system ● Designed to scale to Spark / Hadoop clusters ● Open source / Apache 2 license ● Developed in Java ● Supports R-like and Python-like languages ● Which are designed to scale into the big data range ● Automatic optimization at scale for data and cluster

Slide 2

Slide 2 text

SystemML Execution Modes ● System ML supports multiple execution modes ● Including – Standalone – Spark Batch – Spark MLContext – Hadoop Batch – Java Machine Learning Connector (JMLC)

Slide 3

Slide 3 text

SystemML Dependencies ● System DS forked from ML 1.2 ● Current dependencies – Java 8+ – Scala 2.11+ – Python 2.7/3.5+ – Hadoop 2.6+ – Spark 2.1+

Slide 4

Slide 4 text

What Is Apache SystemDS ? ● Forked from Apache SystemML 1.2 in September 2018 ● Supports linear algebra programs over matrices ● Replaces the underlying data model and compiler ● Substantially extends the supported functionalities ● Supports the whole data science lifecycle – Data integration, cleaning – Feature engineering – Model training ● Over efficient ● Local and distributed ML – Deployment, serving

Slide 5

Slide 5 text

What Is Apache SystemDS ? ● R-like languages for – The data-science life cycle stages – Differing expertise levels ● High-level scripts are compiled into hybrid execution plans – For local, in-memory CPU / GPU operations – For distributed operations on Apache Spark ● Underlying data model are DataTensors – Tensors (multi-dimensional arrays) whose first dimension – May have a heterogeneous and nested schema

Slide 6

Slide 6 text

SystemDS Algorithms ● Descriptive Statistics – Univariate Statistics – Bivariate Statistics – Stratified Bivariate Statistics ● Classification – Multinomial Logistic Regression – Support Vector Machines ● Binary-Class Support Vector Machines ● Multi-Class Support Vector Machines – Naive Bayes – Decision Trees – Random Forests

Slide 7

Slide 7 text

SystemDS Algorithms ● Clustering – K-Means Clustering ● Regression – Linear Regression – Stepwise Linear Regression – Generalized Linear Models – Stepwise Generalized Linear Regression – Regression Scoring and Prediction ● Matrix Factorization – Principal Component Analysis – Matrix Completion via Alternating Minimizations

Slide 8

Slide 8 text

SystemDS Algorithms ● Survival Analysis – Kaplan-Meier Survival Analysis – Cox Proportional Hazard Regression Model ● Factorization Machines – Factorization Machine

Slide 9

Slide 9 text

SystemDS Deep Neural Nets ● Use SystemDS to implement deep neural networks – Specifying network in Keras format / invoke with Keras2DML API – Specifying network in Caffe format / invoke with Caffe2DML API – Use DML-bodied SystemDS-NN library ● Ease training compute resource issues with – Native BLAS (Basic Linear Algebra Subprograms) – SystemDS GPU backend

Slide 10

Slide 10 text

Available Books ● See “Big Data Made Easy” – Apress Jan 2015 ● See “Mastering Apache Spark” – Packt Oct 2015 ● See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” ● Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ ● Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

Slide 11

Slide 11 text

Connect ● Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at – open-source-systems.blogspot.com/ ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration