Introduction to Spark ML / Databricks

Slide 1

Slide 1 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 WORKSHOP :

Slide 2

Slide 2 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 …

Slide 3

Slide 3 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

Slide 4

Slide 4 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

Slide 5

Slide 5 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

Slide 6

Slide 6 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

Slide 7

Slide 7 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

Slide 8

Slide 8 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

Slide 9

Slide 9 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

Slide 10

Slide 10 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 RDD Low-level API

Slide 11

Slide 11 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 …

Slide 12

Slide 12 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

Slide 13

Slide 13 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 val words = sc.textFile("poem.txt") .flatMap(line ⇒ line.split("\\s+")) .map(word ⇒ (word.length, word)) // Number of words by length val wordLen: Map[Int, Long] = words.countByKey()

Slide 14

Slide 14 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 DataFrame API for structured data

Slide 15

Slide 15 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 ≈ …

Slide 16

Slide 16 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

Slide 17

Slide 17 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 val trainDF = spark.read .option("header", "true") .csv("titanic_train.csv") .withColumn("Age", $"Age".cast("double")) .withColumn("Pclass", $"Pclass".cast("int")) trainDF.show(3) trainDF.printSchema() +-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+ |PassengerId|Survived|Pclass| Name| Sex| Age|SibSp|Parch| Ticket| Fare|Cabin|Embarked| +-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+ | 1| 0| 3|Braund, Mr. Owen ...| male|22.0| 1| 0| A/5 21171| 7.25| null| S| | 2| 1| 1|Cumings, Mrs. Joh...|female|38.0| 1| 0| PC 17599|71.2833| C85| C| | 3| 1| 3|Heikkinen, Miss. ...|female|26.0| 0| 0|STON/O2. 3101282| 7.925| null| S| +-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+ only showing top 3 rows root |-- PassengerId: string (nullable = true) |-- Survived: string (nullable = true) |-- Pclass: integer (nullable = true) |-- Name: string (nullable = true) |-- Sex: string (nullable = true) |-- Age: double (nullable = true) |-- ...

Slide 18

Slide 18 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 Spark MLlib / ML Machine Learning

Slide 19

Slide 19 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 • Library of Machine Learning • Algorithms of different types: • Classification • Regression • Clustering • Collaborative filtering • … • Implemented using Spark specific features: • Partitioning of the data • Iterative processing performed in memory • Optimized for large volumes of data

Slide 20

Slide 20 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 • Machine Learning library built on top of RDDs • Specific data structures • Vector: a list of features, can be dense or sparse • LabeledPoint: a vector + a label • In supervised learning: • Training data: RDD[LabeledPoint] • Test data: RDD[Vector] • No longer recommended → Spark ML

Slide 21

Slide 21 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 ● Machine Learning library built on top of DataFrames ➔ New in Spark 1.2 ● API with high-level components: ➔ Estimators: they generate a Model from training data ➔ Transformers: they transform the data ➔ A Model is itself a Transformer ➔ A Machine Learning algorithm is an Estimator

Slide 22

Slide 22 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 ● Steps: ➔ On training data, for each stage: if the stage is an Estimator, call its fit() method and the transform() method of the resulting Model; if the stage is a Transformer, call its transform() method ➔ To make predictions on test or new data, call the transform() method of each Transformer (including Models from the training step) ● Tedious → use Pipelines

Slide 23

Slide 23 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 ● Spark ML provides an API to build a Pipeline composed of Estimators and Transformers ● Steps: ➔ Assemble the Pipeline ➔ Call Pipeline.fit() on training data to get a PipelineModel: the fit() method is called on each Estimator and the transform() method is called on each Transformer (including Models) ➔ Call PipelineModel.transform() on data to make predictions: the transform() method is called on each component of the Pipeline

Slide 24

Slide 24 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 ● Conventions for column names: ➔ The vector of features: “features” ➔ The label in training data: “label” ➔ The predicted label: “prediction” ● These column names are used unless they are explicitly specified

Slide 25

Slide 25 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 Feature engineering (org.apache.spark.ml.feature): • StringIndexer • OneHotEncoder • VectorAssembler • ... ML algorithms: • RandomForestClassifier • LinearRegression • KMeans • ...

Slide 26

Slide 26 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 Spark ML provides support for Grid Search optimization through k-fold Cross Validation • With k=10, the model is trained and evaluated on a first split of 9/10 of the data for training and 1/10 for validation. • The model is then trained on another split, and so on 10 times. The final accuracy is the average of the accuracies measured for each fold

Slide 27

Slide 27 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 ● Parameters are defined through a ParamGridBuilder ● Performance is measured using an Evaluator: ➔ RegressionEvaluator: for numerical predictions ➔ BinaryClassificationEvaluator: for binary predictions ➔ MultiClassClassificationEvaluator: for multi-category predictions ● Instantiate a CrossValidator: ➔ Define the number of folds, the Pipeline, the Estimator, the param grid and the Evaluator ➔ Then call fit() to search for the most accurate model

Slide 28

Slide 28 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

Slide 29

Slide 29 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017 Titanic - on Databricks Scala: http://bit.ly/2nJQc1m Python: http://bit.ly/2mV0iIF Demo

Slide 30

Slide 30 text

UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017