Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Spark ML / Databricks

Introduction to Spark ML / Databricks

Alexis Seigneurin

April 05, 2017
Tweet

More Decks by Alexis Seigneurin

Other Decks in Technology

Transcript

  1. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March

    30th 2017 val words = sc.textFile("poem.txt") .flatMap(line ⇒ line.split("\\s+")) .map(word ⇒ (word.length, word)) // Number of words by length val wordLen: Map[Int, Long] = words.countByKey()
  2. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March

    30th 2017 val trainDF = spark.read .option("header", "true") .csv("titanic_train.csv") .withColumn("Age", $"Age".cast("double")) .withColumn("Pclass", $"Pclass".cast("int")) trainDF.show(3) trainDF.printSchema() +-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+ |PassengerId|Survived|Pclass| Name| Sex| Age|SibSp|Parch| Ticket| Fare|Cabin|Embarked| +-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+ | 1| 0| 3|Braund, Mr. Owen ...| male|22.0| 1| 0| A/5 21171| 7.25| null| S| | 2| 1| 1|Cumings, Mrs. Joh...|female|38.0| 1| 0| PC 17599|71.2833| C85| C| | 3| 1| 3|Heikkinen, Miss. ...|female|26.0| 0| 0|STON/O2. 3101282| 7.925| null| S| +-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+ only showing top 3 rows root |-- PassengerId: string (nullable = true) |-- Survived: string (nullable = true) |-- Pclass: integer (nullable = true) |-- Name: string (nullable = true) |-- Sex: string (nullable = true) |-- Age: double (nullable = true) |-- ...
  3. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March

    30th 2017 • Library of Machine Learning • Algorithms of different types: • Classification • Regression • Clustering • Collaborative filtering • … • Implemented using Spark specific features: • Partitioning of the data • Iterative processing performed in memory • Optimized for large volumes of data
  4. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March

    30th 2017 • Machine Learning library built on top of RDDs • Specific data structures • Vector: a list of features, can be dense or sparse • LabeledPoint: a vector + a label • In supervised learning: • Training data: RDD[LabeledPoint] • Test data: RDD[Vector] • No longer recommended → Spark ML
  5. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March

    30th 2017 • Machine Learning library built on top of DataFrames ➔ New in Spark 1.2 • API with high-level components: ➔ Estimators: they generate a Model from training data ➔ Transformers: they transform the data ➔ A Model is itself a Transformer ➔ A Machine Learning algorithm is an Estimator
  6. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March

    30th 2017 • Steps: ➔ On training data, for each stage: if the stage is an Estimator, call its fit() method and the transform() method of the resulting Model; if the stage is a Transformer, call its transform() method ➔ To make predictions on test or new data, call the transform() method of each Transformer (including Models from the training step) • Tedious → use Pipelines
  7. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March

    30th 2017 • Spark ML provides an API to build a Pipeline composed of Estimators and Transformers • Steps: ➔ Assemble the Pipeline ➔ Call Pipeline.fit() on training data to get a PipelineModel: the fit() method is called on each Estimator and the transform() method is called on each Transformer (including Models) ➔ Call PipelineModel.transform() on data to make predictions: the transform() method is called on each component of the Pipeline
  8. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March

    30th 2017 • Conventions for column names: ➔ The vector of features: “features” ➔ The label in training data: “label” ➔ The predicted label: “prediction” • These column names are used unless they are explicitly specified
  9. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March

    30th 2017 Feature engineering (org.apache.spark.ml.feature): • StringIndexer • OneHotEncoder • VectorAssembler • ... ML algorithms: • RandomForestClassifier • LinearRegression • KMeans • ...
  10. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March

    30th 2017 Spark ML provides support for Grid Search optimization through k-fold Cross Validation • With k=10, the model is trained and evaluated on a first split of 9/10 of the data for training and 1/10 for validation. • The model is then trained on another split, and so on 10 times. The final accuracy is the average of the accuracies measured for each fold
  11. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March

    30th 2017 • Parameters are defined through a ParamGridBuilder • Performance is measured using an Evaluator: ➔ RegressionEvaluator: for numerical predictions ➔ BinaryClassificationEvaluator: for binary predictions ➔ MultiClassClassificationEvaluator: for multi-category predictions • Instantiate a CrossValidator: ➔ Define the number of folds, the Pipeline, the Estimator, the param grid and the Evaluator ➔ Then call fit() to search for the most accurate model
  12. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March

    30th 2017 Titanic - on Databricks Scala: http://bit.ly/2nJQc1m Python: http://bit.ly/2mV0iIF Demo