Rights Reserved Enriching truck events for analysis with Pig HDFS Raw Truck Events Weather Data Sets Raw Weather Data HCatalog (Metadata) Payroll Data HR & Payroll DBs Load Raw Truck Events Clean & Filter Cleaned Events Transformed Events Transform Join with HR & weather data Enriched Events Enriched Events Store Tableau
Rights Reserved CDO’s vision: Build a PredicBve Business, not a ReacBve one CDO’s Requirements § Offline predic)ons § Iden)fy investments that will increase safety and reduce company’s liabili)es § Real-‐)me predic)ons § An)cipate driver viola)ons before they happen and take precau)onary ac)ons Data Scien)st’s Response § ♬ I’ve been wai8ng for this moment all my life ♬ § Verify BI tool trends against TBs of events data via machine learning § Generate predicBve models with Spark MLlib on HDP § Plug in Spark models in Storm to predict driver violaBons in real-‐Bme
Rights Reserved Truck Sensors HDFS YARN Integrate PredicBve AnalyBcs in Stream Processing Stream Processing (Storm) Inbound Messaging (Ka`a) InteracBve Query (Hive on Tez) Real-‐Bme Serving (HBase) Millions of Enriched Truck Events PredicBon Bolt Plug Spark model into Storm bolt Machine Learning (Spark) Train Spark ML model with millions of truck events
Rights Reserved Building the PredicBve Model on HDP Tableau Explore small subset of events to idenBfy predicBve features and make a hypothesis. E.g. hypothesis: “foggy weather causes driver viola8ons” 1 IdenBfy suitable ML algorithms to train a model – we will use classificaBon algorithms as we have labeled events data 2 Transform enriched events data to a format that is friendly to Spark MLlib – many ML libs expect training data in a certain format 3 Train a logisBc classificaBon Spark model on YARN, with above events as training input, and iterate to fine tune generated model 4 Integrate Spark MLlib model in a Storm bolt to predict violaBons in real Bme 5
Rights Reserved Transforming training data for Spark MLlib Enriched Events Data Event Type Is Driver Cer)fied? Wage Plan Hours Driven Miles Driven Longitude La)tude Weather Foggy Weather Rainy Weather Windy Normal Yes Hourly 45 2721 -‐91.3 38.14 No No No Overspeed No Miles 72 4152 -‐94.23 37.09 Yes Yes No … … … … … … … … … … Spark MLlib Training Data Label Is Driver Cer)fied? Wage Plan Hours Driven Miles Driven Weather Foggy Weather Rainy Weather Windy 0 1 1 0.45 0.2721 0 0 0 1 0 0 0.72 0.4152 1 1 0 … … … … … … … … Normal events labeled as 0 and violaBon events as 1 Feature scaling applied to hours and miles to improve algorithm performance Features with binary values denoted as 0 and 1
Rights Reserved Running Spark ML on YARN 1 spark-‐submit -‐-‐class org.apache.spark.examples.mllib.BinaryClassifica8on -‐-‐master yarn-‐cluster -‐-‐ num-‐executors 3 -‐-‐driver-‐memory 512m -‐-‐executor-‐memory 512m -‐-‐executor-‐cores 1 truckml.jar -‐-‐algorithm LR -‐-‐regType L2 -‐-‐regParam 1.0 /user/root/truck_training -‐-‐numItera3ons 100 Run spark-‐submit script to launch a Spark job on YARN. Training data locaBon on HDFS 2 Monitor progress of Spark job in YARN Resource Mgr UI
Rights Reserved IntegraBng Spark model in Storm Ka`a Spout Storm PredicBon Bolt § IniBalize Spark model § Parse truck event § Enrich event with HBase data § Predict violaBon with model § Send Alert if violaBon predicted Real-‐Bme Serving (HBase) Ac)ve MQ Ops Center LOB Dashboards
Rights Reserved RecommendaBons to CDO § Investment recommenda)ons, in order of priority 1. Invest in visibility sensors and auto braking systems to deal with foggy condiBons 2. Invest in slip resistant Bres to fight rainy condiBons 3. Invest in cerBfying drivers to reduce violaBon probability § Power of real )me predic)ons § 40% reducBon in violaBon rates by predicBng high risk situaBons in real-‐Bme and sending immediate alerts to drivers
Rights Reserved Value of large scale ML on HDP § Accelerate )me to market/value § Test out mulBple ML algorithms against TBs of training data in reasonable Bme frames § Confirm hypothesis against TBs of training data with confidence § We confirmed that fog does impact safety and wage plans do not, whereas BI tools indicated otherwise § Easily integrate predic)ve models in data driven apps § Run predicBve models in Storm or any other app in your enterprise § Run all of the above in a mul)-‐tenant YARN cluster § Large scale ML on YARN respects other tenants in an HDP cluster
Rights Reserved Calling Spark from a Storm Bolt § The outputs of a logisBc regression model are weights and an intercept value: val algorithm = new Logis)cRegressionWithSGD() val model = algorithm.run(training).clearThreshold() println(model.weights) println(model.intercept) Weights[-‐0.40819922025591465,0.06392530395655666,-‐0.1346227352186122,-‐0.07188217286407801,0.7277326276521062,0.50877 9221680863,-‐0.024689093098281954] Intercept 0.0 § The model can then be reconstructed in a Storm bolt with the above weights to make predicBons import org.apache.spark.mllib.classifica)on.Logis)cRegressionModel; import org.apache.spark.mllib.linalg.Vectors; ……….. Vector weights = (Vectors.dense(new double[] <array of weights like above>) Logis)cRegressionModel model = new Logis)cRegressionModel(weights, 0.0); double predic)on = model.predict(<input features>)