Scalable Machine Learning

me: Software Engineer, Freelance Big Data, Distributed Computing, Machine Learning
Paris Data Geek Co-organizer @DataParis @samklr Sam Bessalah

Machine Learning Land VOWPAL WABBIT

Some Observations in Big Data Land • New use cases
push towards faster execution platforms and real time predictions engines. • Traditional MapReduce on Hadoop is fading away, especially for Machine Learning • Apache Spark has become the darling of the Big Data world, thanks to its high level API and performances. • Rise of Machine Learning public APIs to easily integrate models into application and other data processing workflows.

• Used to be the only Hadoop MapReduce Framework •
Moved from MapReduce towards modern and faster backends, namely • Now provide a fluent DSL that integrates with Scala and Spark

Mahout Example Simple Co-occurence analysis in Mahout val A =
drmFromHDFS (“ hdfs://nivdul/babygirl.txt“) val cooccurencesMatrix = A.t %*% A val numInteractions = drmBroadcast(A.colsums) val I = C.mapBlock(){ case (keys, block) => val indicatorBlock = sparse(row, col) for (r <- block ) indicatorBlock = computeLLR (row, nbInt) keys <- indicatorblock }

Dataflow system, materialized by immutable and lazy, in-memory distributed collections
suited for iterative and complex transformations, like in most Machine Learning algorithms. Those in-memory collections are called Resilient Distributed Datasets (RDD) They provide : • Partitioned data • High level operations (map, filter, collect, reduce, zip, join, sample, etc …) • No side effects • Fault recovery via lineage

Some operations on RDDs

Spark Ecosystem

MLlib Machine Learning library within Spark : • Provides an
integrated predictive and data analysis workflow • Broad collections of algorithms and applications • Integrates with the whole Spark Ecosystem Three APIs in :

Algorithms in MLlib

Example: Clustering via K-means // Load and parse data val
data = sc.textFile(“hdfs://bbgrl/dataset.txt”) val parsedData = data.map { x => Vectors.dense(x.split(“ “).map.(_.toDouble )) }. cache() //Cluster data into 5 classes using K-means val clusters = Kmeans.train(parsedData, k=5, numIterations=20 ) //Evaluate model error val cost = clusters.computeCost(parsedData)

Coming to Spark 1.2 • Ensembles of decision trees :
Random Forests • Boosting • Topic modeling • Streaming Kmeans • A pipeline interface for machine workflows A lot of contributions from the community

Machine Learning Pipeline Typical machine learning workflows are complex !
Coming in next iterations of MLLib

• H 2 0 is a fast (really fast), statistics,
Machine Learning and maths engine on the JVM. • Edited by 0xdata (commercial entity) and focus on bringing robust and highly performant machine learning algorithms to popular Big Data workloads. • Has APIs in R, Java, Scala and Python and integrates to third parties tools like Tableau and Excel.

Example in R library(h2o) localH2O = h2o.init(ip = 'localhost', port
= 54321) irisPath = system.file("extdata", "iris.csv", package="h2o") iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex") iris.data.frame <- as.data.frame(iris.hex) > colnames(iris.hex) [1] "C1" "C2" "C3" "C4" "C5" >

Simple Logistic Regressioon to predict prostate cancer outcomes: > prostate.hex
= h2o.importFile(localH2O, path="https://raw.github.com/0xdata/h2o/../prostate.csv", key = "prostate.hex") > prostate.glm = h2o.glm(y = "CAPSULE", x =c("AGE","RACE","PSA"," DCAPS"), data = prostate.hex,family = "binomial", nfolds = 10, alpha = 0.5) > prostate.fit = h2o.predict(object=prostate.glm, newdata = prostate. hex)

> (prostate.fit) IP Address: 127.0.0.1 Port : 54321 Parsed Data
Key: GLM2Predict_8b6890653fa743be9eb3ab1668c5a6e9 predict X0 X1 1 0 0.7452267 0.2547732 2 1 0.3969807 0.6030193 3 1 0.4120950 0.5879050 4 1 0.3726134 0.6273866 5 1 0.6465137 0.3534863 6 1 0.4331880 0.5668120

Sparkling Water Transparent use of H2O data and algorithms with
the Spark API. Provides a custom RDD : H2ORDD

val sqlContext = new SQLContext(sc) import sqlContext._ airlinesTable.registerTempTable("airlinesTable") //H20 methods
val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR Dest LIKE 'OAK'“ val result = sql(query) result.count

Same but with Spark API // H2O Context provide useful
implicits for conversions val h2oContext = new H2OContext(sc) import h2oContext._ // Create RDD wrapper around DataFrame val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData) airlinesTable.count // And use Spark RDD API directly val flightsOnlyToSF = airlinesTable.filter(f => f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK") ) flightsOnlyToSF.count

Build a model import hex.deeplearning._ import hex.deeplearning.DeepLearningModel.DeepLearningParameters val dlParams =
new DeepLearningParameters() dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, DayOfWeek, 'CRSDepTime, 'CRSArrTime,'UniqueCarrier, FlightNum, 'TailNum, 'CRSElapsedTime, 'Origin, 'Dest,'Distance,‘IsDepDelayed) dlParams.response_column = 'IsDepDelayed.name // Create a new model builder val dl = new DeepLearning(dlParams) val dlModel = dl.train.get

Predict // Use model to score data val prediction =
dlModel.score(result)(‘predict) // Collect predicted values via the RDD API val predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse("NaN") )

Slides: http://speakerdeck.com/samklr/

Scalable Machine Learning

Scalable Machine Learning

Sam Bessalah

More Decks by Sam Bessalah

Other Decks in Technology

Featured

Transcript