Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Spark ML / Databricks

Introduction to Spark ML / Databricks

Alexis Seigneurin

April 05, 2017
Tweet

More Decks by Alexis Seigneurin

Other Decks in Technology

Transcript

  1. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    WORKSHOP :

    View Slide

  2. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

    View Slide

  3. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

    View Slide

  4. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

    View Slide

  5. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

    View Slide

  6. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

    View Slide

  7. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

    View Slide

  8. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

    View Slide

  9. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

    View Slide

  10. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    RDD
    Low-level API

    View Slide

  11. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

    View Slide

  12. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

    View Slide

  13. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    val words = sc.textFile("poem.txt")
    .flatMap(line ⇒ line.split("\\s+"))
    .map(word ⇒ (word.length, word))
    // Number of words by length
    val wordLen: Map[Int, Long] = words.countByKey()

    View Slide

  14. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    DataFrame
    API for structured data

    View Slide

  15. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017


    View Slide

  16. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

    View Slide

  17. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    val trainDF = spark.read
    .option("header", "true")
    .csv("titanic_train.csv")
    .withColumn("Age", $"Age".cast("double"))
    .withColumn("Pclass", $"Pclass".cast("int"))
    trainDF.show(3)
    trainDF.printSchema()
    +-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
    |PassengerId|Survived|Pclass| Name| Sex| Age|SibSp|Parch| Ticket| Fare|Cabin|Embarked|
    +-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
    | 1| 0| 3|Braund, Mr. Owen ...| male|22.0| 1| 0| A/5 21171| 7.25| null| S|
    | 2| 1| 1|Cumings, Mrs. Joh...|female|38.0| 1| 0| PC 17599|71.2833| C85| C|
    | 3| 1| 3|Heikkinen, Miss. ...|female|26.0| 0| 0|STON/O2. 3101282| 7.925| null| S|
    +-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
    only showing top 3 rows
    root
    |-- PassengerId: string (nullable = true)
    |-- Survived: string (nullable = true)
    |-- Pclass: integer (nullable = true)
    |-- Name: string (nullable = true)
    |-- Sex: string (nullable = true)
    |-- Age: double (nullable = true)
    |-- ...

    View Slide

  18. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    Spark MLlib / ML
    Machine Learning

    View Slide

  19. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    • Library of Machine Learning
    • Algorithms of different types:
    • Classification
    • Regression
    • Clustering
    • Collaborative filtering
    • …
    • Implemented using Spark specific features:
    • Partitioning of the data
    • Iterative processing performed in memory
    • Optimized for large volumes of data

    View Slide

  20. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    • Machine Learning library built on top of RDDs
    • Specific data structures
    • Vector: a list of features, can be dense or sparse
    • LabeledPoint: a vector + a label
    • In supervised learning:
    • Training data: RDD[LabeledPoint]
    • Test data: RDD[Vector]
    • No longer recommended → Spark ML

    View Slide

  21. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    ● Machine Learning library built on top of DataFrames
    ➔ New in Spark 1.2
    ● API with high-level components:
    ➔ Estimators: they generate a Model from training data
    ➔ Transformers: they transform the data
    ➔ A Model is itself a Transformer
    ➔ A Machine Learning algorithm is an Estimator

    View Slide

  22. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    ● Steps:
    ➔ On training data, for each stage: if the stage is an Estimator, call its fit()
    method and the transform() method of the resulting Model; if the stage is a
    Transformer, call its transform() method
    ➔ To make predictions on test or new data, call the transform() method of
    each Transformer (including Models from the training step)
    ● Tedious → use Pipelines

    View Slide

  23. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    ● Spark ML provides an API to build a Pipeline composed of
    Estimators and Transformers
    ● Steps:
    ➔ Assemble the Pipeline
    ➔ Call Pipeline.fit() on training data to get a PipelineModel: the fit() method is
    called on each Estimator and the transform() method is called on each
    Transformer (including Models)
    ➔ Call PipelineModel.transform() on data to make predictions: the transform()
    method is called on each component of the Pipeline

    View Slide

  24. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    ● Conventions for column names:
    ➔ The vector of features: “features”
    ➔ The label in training data: “label”
    ➔ The predicted label: “prediction”
    ● These column names are used unless they are explicitly
    specified

    View Slide

  25. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    Feature engineering (org.apache.spark.ml.feature):
    • StringIndexer
    • OneHotEncoder
    • VectorAssembler
    • ...
    ML algorithms:
    • RandomForestClassifier
    • LinearRegression
    • KMeans
    • ...

    View Slide

  26. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    Spark ML provides support for Grid Search optimization through
    k-fold Cross Validation
    • With k=10, the model is trained and evaluated on a first split of 9/10 of the data
    for training and 1/10 for validation.
    • The model is then trained on another split, and so on 10 times.
    The final accuracy is the average of the accuracies measured for
    each fold

    View Slide

  27. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    ● Parameters are defined through a ParamGridBuilder
    ● Performance is measured using an Evaluator:
    ➔ RegressionEvaluator: for numerical predictions
    ➔ BinaryClassificationEvaluator: for binary predictions
    ➔ MultiClassClassificationEvaluator: for multi-category predictions
    ● Instantiate a CrossValidator:
    ➔ Define the number of folds, the Pipeline, the Estimator, the param grid and
    the Evaluator
    ➔ Then call fit() to search for the most accurate model

    View Slide

  28. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

    View Slide

  29. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017
    Titanic - on Databricks
    Scala: http://bit.ly/2nJQc1m
    Python: http://bit.ly/2mV0iIF
    Demo

    View Slide

  30. UpSkill Workshop 9: Introduction to Spark/Databricks McLean, VA | March 30th 2017

    View Slide