Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalable Machine Learning

Sam Bessalah
November 20, 2014

Scalable Machine Learning

Quick overview of MLLib and H2O

Sam Bessalah

November 20, 2014
Tweet

More Decks by Sam Bessalah

Other Decks in Technology

Transcript

  1. Scalable Machine
    Learning

    View Slide

  2. me:
    Software Engineer, Freelance
    Big Data, Distributed Computing, Machine Learning
    Paris Data Geek Co-organizer
    @DataParis
    @samklr
    Sam Bessalah

    View Slide

  3. Machine Learning Land
    VOWPAL WABBIT

    View Slide

  4. Some Observations in Big Data Land
    ● New use cases push towards faster execution platforms and real
    time predictions engines.
    ● Traditional MapReduce on Hadoop is fading away, especially for
    Machine Learning
    ● Apache Spark has become the darling of the Big Data world,
    thanks to its high level API and performances.
    ● Rise of Machine Learning public APIs to easily integrate models
    into application and other data processing workflows.

    View Slide

  5. ● Used to be the only Hadoop MapReduce Framework
    ● Moved from MapReduce towards modern and faster
    backends, namely
    ● Now provide a fluent DSL that integrates with Scala and
    Spark

    View Slide

  6. View Slide

  7. Mahout Example
    Simple Co-occurence analysis in Mahout
    val A =
    drmFromHDFS (“ hdfs://nivdul/babygirl.txt“)
    val cooccurencesMatrix = A.t %*% A
    val numInteractions =
    drmBroadcast(A.colsums)
    val I = C.mapBlock(){
    case (keys, block) =>
    val indicatorBlock = sparse(row, col)
    for (r <- block )
    indicatorBlock = computeLLR (row, nbInt)
    keys <- indicatorblock
    }

    View Slide

  8. Dataflow system, materialized by immutable and lazy, in-memory distributed
    collections suited for iterative and complex transformations, like in most Machine
    Learning algorithms.
    Those in-memory collections are called Resilient Distributed Datasets (RDD)
    They provide :
    ● Partitioned data
    ● High level operations (map, filter, collect, reduce, zip, join, sample, etc …)
    ● No side effects
    ● Fault recovery via lineage

    View Slide

  9. Some operations on RDDs

    View Slide

  10. Spark Ecosystem

    View Slide

  11. MLlib
    Machine Learning library within Spark :
    ● Provides an integrated predictive and data analysis
    workflow
    ● Broad collections of algorithms and applications
    ● Integrates with the whole Spark Ecosystem
    Three APIs in :

    View Slide

  12. Algorithms in MLlib

    View Slide

  13. Example: Clustering via K-means
    // Load and parse data
    val data = sc.textFile(“hdfs://bbgrl/dataset.txt”)
    val parsedData = data.map { x =>
    Vectors.dense(x.split(“ “).map.(_.toDouble )) }.
    cache()
    //Cluster data into 5 classes using K-means
    val clusters = Kmeans.train(parsedData, k=5, numIterations=20 )
    //Evaluate model error
    val cost = clusters.computeCost(parsedData)

    View Slide

  14. View Slide

  15. Coming to Spark 1.2
    ● Ensembles of decision trees : Random Forests
    ● Boosting
    ● Topic modeling
    ● Streaming Kmeans
    ● A pipeline interface for machine workflows
    A lot of contributions from the community

    View Slide

  16. Machine Learning Pipeline
    Typical machine learning workflows are complex !
    Coming in next iterations of MLLib

    View Slide

  17. ● H
    2
    0 is a fast (really fast), statistics, Machine Learning
    and maths engine on the JVM.
    ● Edited by 0xdata (commercial entity) and focus on
    bringing robust and highly performant machine learning
    algorithms to popular Big Data workloads.
    ● Has APIs in R, Java, Scala and Python and integrates
    to third parties tools like Tableau and Excel.

    View Slide

  18. View Slide

  19. Example in R
    library(h2o)
    localH2O = h2o.init(ip = 'localhost', port = 54321)
    irisPath = system.file("extdata", "iris.csv", package="h2o")
    iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex")
    iris.data.frame <- as.data.frame(iris.hex)
    > colnames(iris.hex)
    [1] "C1" "C2" "C3" "C4" "C5"
    >

    View Slide

  20. Simple Logistic Regressioon to predict prostate cancer outcomes:
    > prostate.hex = h2o.importFile(localH2O,
    path="https://raw.github.com/0xdata/h2o/../prostate.csv",
    key = "prostate.hex")
    > prostate.glm = h2o.glm(y = "CAPSULE", x =c("AGE","RACE","PSA","
    DCAPS"),
    data = prostate.hex,family = "binomial", nfolds = 10, alpha = 0.5)
    > prostate.fit = h2o.predict(object=prostate.glm, newdata = prostate.
    hex)

    View Slide

  21. > (prostate.fit)
    IP Address: 127.0.0.1
    Port : 54321
    Parsed Data Key: GLM2Predict_8b6890653fa743be9eb3ab1668c5a6e9
    predict X0 X1
    1 0 0.7452267 0.2547732
    2 1 0.3969807 0.6030193
    3 1 0.4120950 0.5879050
    4 1 0.3726134 0.6273866
    5 1 0.6465137 0.3534863
    6 1 0.4331880 0.5668120

    View Slide

  22. Sparkling Water
    Transparent use of H2O data and algorithms with the Spark API.
    Provides a custom RDD : H2ORDD

    View Slide

  23. View Slide

  24. View Slide

  25. val sqlContext = new SQLContext(sc)
    import sqlContext._
    airlinesTable.registerTempTable("airlinesTable") //H20 methods
    val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest
    LIKE 'SJC' OR Dest LIKE 'OAK'“
    val result = sql(query)
    result.count

    View Slide

  26. Same but with Spark API
    // H2O Context provide useful implicits for conversions
    val h2oContext = new H2OContext(sc)
    import h2oContext._
    // Create RDD wrapper around DataFrame
    val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)
    airlinesTable.count
    // And use Spark RDD API directly
    val flightsOnlyToSF = airlinesTable.filter(f =>
    f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK")
    )
    flightsOnlyToSF.count

    View Slide

  27. Build a model
    import hex.deeplearning._
    import hex.deeplearning.DeepLearningModel.DeepLearningParameters
    val dlParams = new DeepLearningParameters()
    dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, DayOfWeek,
    'CRSDepTime, 'CRSArrTime,'UniqueCarrier,
    FlightNum, 'TailNum, 'CRSElapsedTime,
    'Origin, 'Dest,'Distance,‘IsDepDelayed)
    dlParams.response_column = 'IsDepDelayed.name
    // Create a new model builder
    val dl = new DeepLearning(dlParams)
    val dlModel = dl.train.get

    View Slide

  28. Predict
    // Use model to score data
    val prediction = dlModel.score(result)(‘predict)
    // Collect predicted values via the RDD API
    val predictionValues = toRDD[DoubleHolder](prediction)
    .collect
    .map ( _.result.getOrElse("NaN") )

    View Slide

  29. View Slide

  30. Slides: http://speakerdeck.com/samklr/

    View Slide