Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Apply Machine Learning with R, H20, Apac...

How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real Time Streaming Analytics

How to Apply Machine Learning and Big Data Analytics to Real Time Processing
In March 2016, I had a talk at Voxxed Zurich about “How to Apply Machine Learning and Big Data Analytics to Real Time Processing”.
Finding Insights with R, H20, Apache Spark MLlib, PMML and TIBCO Spotfire
"Big Data" is currently a big hype. Large amounts of historical data are stored in Hadoop or other platforms. Business Intelligence tools and statistical computing are used to draw new knowledge and to find patterns from this data, for example for promotions, cross-selling or fraud detection. The key challenge is how these findings can be integrated from historical data into new transactions in real time to make customers happy, increase revenue or prevent fraud.
Putting Analytic Models into Action via Event Processing and Streaming Analytics
"Fast Data" via stream processing is the solution to embed patterns - which were obtained from analyzing historical data - into future transactions in real-time. The following slide deck uses several real world success stories to explain the concepts behind stream processing and its relation to Apache Hadoop and other big data platforms. I discuss how patterns and statistical models of R, Apache Spark MLlib, H20, and other technologies can be integrated into real-time processing using open source stream processing frameworks (such as Apache Storm, Spark Streaming or Flink) or products (such as IBM InfoSphere Streams or TIBCO StreamBase). A live demo showed the complete development lifecycle combining analytics with TIBCO Spotfire, machine learning via R and stream processing via TIBCO StreamBase and TIBCO Live Datamart.

Kai Waehner

March 03, 2016
Tweet

More Decks by Kai Waehner

Other Decks in Technology

Transcript

  1. HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO

    REAL TIME PROCESSING Kai Wähner [email protected] @KaiWaehner www.kai-waehner.de LinkedIn / Xing  Please connect!
  2. 2 Digital Transformation - Physical and Digital Worlds are Merging

    © Copyright 2000-2016 TIBCO Software Inc.
  3. 3 Apply Big Data Analytics to Real Time Processing ©

    Copyright 2000-2016 TIBCO Software Inc.
  4. Key Take-Aways  Insights are hidden in Historical Data on

    Big Data Platforms  Machine Learning and Big Data Analytics find these Insights by building Analytics Models  Event Processing uses these Models (without Rebuilding) to take Action in Real Time
  5. 6 Agenda © Copyright 2000-2016 TIBCO Software Inc. 1) Machine

    Learning and Big Data Analytics 2) Analysis of Historical Data 3) Real Time Processing 4) Live Demo
  6. 7 Agenda © Copyright 2000-2016 TIBCO Software Inc. 1) Machine

    Learning and Big Data Analytics 2) Analysis of Historical Data 3) Real Time Processing 4) Live Demo
  7. 8 Machine Learning © Copyright 2000-2016 TIBCO Software Inc. Machine

    learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look. http://www.sas.com
  8. 9 10 Examples of Machine Learning © Copyright 2000-2016 TIBCO

    Software Inc. • Spam Detection • Credit Card Fraud Detection • Digit Recognition • Speech Understanding • Face Detection • Shape Detection • Product Recommendation • Medical Diagnosis • Stock Trading • Customer Segmentation http://machinelearningmastery.com/practical-machine-learning-problems/
  9. 10 10 Examples of Machine Learning © Copyright 2000-2016 TIBCO

    Software Inc. • Spam Detection: Given email in an inbox, identify those email messages that are spam and those that are not. Having a model of this problem would allow a program to leave non-spam emails in the inbox and move spam emails to a spam folder. We should all be familiar with this example. • Credit Card Fraud Detection: Given credit card transactions for a customer in a month, identify those transactions that were made by the customer and those that were not. A program with a model of this decision could refund those transactions that were fraudulent. • Digit Recognition: Given a zip codes hand written on envelops, identify the digit for each hand written character. A model of this problem would allow a computer program to read and understand handwritten zip codes and sort envelops by geographic region. • Speech Understanding: Given an utterance from a user, identify the specific request made by the user. A model of this problem would allow a program to understand and make an attempt to fulfil that request. The iPhone with Siri has this capability. • Face Detection: Given a digital photo album of many hundreds of digital photographs, identify those photos that include a given person. A model of this decision process would allow a program to organize photos by person. Some cameras and software like iPhoto has this capability. http://machinelearningmastery.com/practical-machine-learning-problems/
  10. 11 10 Examples of Machine Learning © Copyright 2000-2016 TIBCO

    Software Inc. • Product Recommendation: Given a purchase history for a customer and a large inventory of products, identify those products in which that customer will be interested and likely to purchase. A model of this decision process would allow a program to make recommendations to a customer and motivate product purchases. Amazon has this capability. Also think of Facebook, GooglePlus and Facebook that recommend users to connect with you after you sign-up. • Medical Diagnosis: Given the symptoms exhibited in a patient and a database of anonymized patient records, predict whether the patient is likely to have an illness. A model of this decision problem could be used by a program to provide decision support to medical professionals. • Stock Trading: Given the current and past price movements for a stock, determine whether the stock should be bought, held or sold. A model of this decision problem could provide decision support to financial analysts. • Customer Segmentation: Given the pattern of behaviour by a user during a trial period and the past behaviours of all users, identify those users that will convert to the paid version of the product and those that will not. A model of this decision problem would allow a program to trigger customer interventions to persuade the customer to covert early or better engage in the trial. • Shape Detection: Given a user hand drawing a shape on a touch screen and a database of known shapes, determine which shape the user was trying to draw. A model of this decision would allow a program to show the platonic version of that shape the user drew to make crisp diagrams. The Instaviz iPhone app does this. http://machinelearningmastery.com/practical-machine-learning-problems/
  11. 12 Types of Machine Learning Problems © Copyright 2000-2016 TIBCO

    Software Inc. • Classification: Data is labelled meaning it is assigned a class, for example spam / non-spam or fraud / non-fraud. • Regression: Data is labelled with a real value (think floating point) rather then a label. Examples that are easy to understand are time series data like the price of a stock over time. • Clustering: Data is not labelled, but can be divided into groups based on similarity and other measures of natural structure in the data. An example from would be organising pictures by faces without names. • Rule Extraction: Data is used as the basis for the extraction of propositional rules (antecedent/consequent aka if-then). An example is the discovery of the relationship between the purchase of beer and diapers. http://machinelearningmastery.com/practical-machine-learning-problems/ (no complete list!)
  12. © Copyright 2000-2016 TIBCO Software Inc. Closed Loop for Big

    Data Analytics MODEL Develop model Deploy into Stream Processing flow ACT Automatically monitor real-time transactions Automatically trigger action ANALYZE Analyze data via Data Discovery Uncover patterns, trends, correlations
  13. 14 Analytics Maturity Model © Copyright 2000-2016 TIBCO Software Inc.

    Immediate Long-Term Competitive Advantage Value to the Organization A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity Self-service Dashboards Event Processing Analytics
  14. 15 Analytics Maturity Model © Copyright 2000-2016 TIBCO Software Inc.

    Immediate Long-Term Competitive Advantage Value to the Organization Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Analytics
  15. 16 Analytics Maturity Model © Copyright 2000-2016 TIBCO Software Inc.

    Immediate Long-Term Competitive Advantage Value to the Organization Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Self-service Dashboards Event Processing Analytics
  16. 17 Agenda © Copyright 2000-2016 TIBCO Software Inc. 1) Machine

    Learning and Big Data Analytics 2) Analysis of Historical Data 3) Real Time Processing 4) Live Demo
  17. 19 Analytics Maturity Model © Copyright 2000-2016 TIBCO Software Inc.

    Immediate Long-Term Competitive Advantage Value to the Organization A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity Self-service Dashboards Event Processing Analytics
  18. cust_id dept sku dollar gift date 1 104 C 12003

    2.40 FALSE 2016-10-17 2 105 A 12005 62.85 FALSE 2016-10-17 3 102 C 12007 69.23 TRUE 2016-10-17 4 104 B 12004 9.33 FALSE 2016-10-18 5 105 C 12010 14.16 TRUE 2016-10-18 6 101 B 12003 90.43 FALSE 2016-10-19 7 103 C 12005 90.97 FALSE 2016-10-19 n … … … … … … cust_id A B C total # orders first_date last_date 1 100 21.76 23.67 0.00 45.43 2 2016-10-19 2016-10-20 2 101 0.01 74.65 0.00 74.66 3 2016-10-19 2016-10-20 3 102 0.00 60.92 50.29 111.21 6 2016-10-17 2016-10-20 4 103 0.00 0.00 52.30 52.30 2 2016-10-19 2016-10-20 5 104 31.34 9.33 2.40 43.06 4 2016-10-17 2016-10-20 6 105 62.85 0.00 56.00 118.85 3 2016-10-17 2016-10-20 © Copyright 2000-2016 TIBCO Software Inc. Data Munging - Transformations
  19. Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis

    that employs a variety of techniques (mostly graphical) 1. to maximize insight into a data set 2. uncover underlying structure 3. extract important variables 4. detect outliers and anomalies 5. test underlying assumptions 6. develop parsimonious models 7. determine optimal factor settings © Copyright 2000-2016 TIBCO Software Inc. Exploratory Data Analysis
  20. “The greatest value of a picture is when it forces

    us to notice what we never expected to see” John W. Tukey, 1977 © Copyright 2000-2016 TIBCO Software Inc. Exploratory Data Analysis
  21. 31 Analytics Maturity Model © Copyright 2000-2016 TIBCO Software Inc.

    Immediate Long-Term Competitive Advantage Value to the Organization Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Analytics
  22. © Copyright 2000-2016 TIBCO Software Inc. Which picture represents a

    model? A model is a simplification of the truth that helps you with decision making.
  23. © Copyright 2000-2016 TIBCO Software Inc. Model Building Supervised Models

    – known, labeled responses • Regression (for example Linear Regression) • Categorical (for example Random Forest) Unsupervised Models – no labeled responses • Clustering (for example k-means clustering)
  24. Employees who write longer emails earn higher salaries! © Copyright

    2000-2016 TIBCO Software Inc. Model Building
  25. © Copyright 2000-2016 TIBCO Software Inc. Model Validation How is

    the IQ of a kid related to the IQ of his / her mum?
  26. Data Scientists work with many Tools © Copyright 2000-2016 TIBCO

    Software Inc. • SQL • Excel • Python • R Source: O’Reilly 2015 Data Science Salary Survey http://duu86o6n09pv.cloudfront.net/reports/2015- data-science-salary-survey.pdf
  27. 44 Alternatives for Data Scientists © Copyright 2000-2016 TIBCO Software

    Inc. Open Source Closed Source Tooling Source Code (no complete list) R
  28. R Language R is well known as the most and

    increasingly getting more popular programming language used by data scientists for modeling. It is developing very rapidly with a very active community. © Copyright 2000-2016 TIBCO Software Inc.
  29. R with Revolution Analytics (now Microsoft) © Copyright 2000-2016 TIBCO

    Software Inc. Open Source GPL License (including its restrictions) http://www.revolutionanalytics.com/webinars/introducing-revolution-r-open-enhanced-open-source-r-distribution-revolution-analytics
  30. • TIBCO has rewritten R as a Commercial Compute Engine

    • Latest statistics scripting engine: S a S-PLUS® a R a TERR • Runs R code including CRAN packages • Engine internals rebuilt from scratch at low-level • Redesigned data objects, memory management • High performance + Big Data • TERR is licensed from TIBCO • TERR Installs (free) with Spotfire Analyst / Desktop + other TIBCO products • Spotfire Server can manage all TERR / R scripts, artifacts for reuse • Standalone Developer Edition • Supported by TIBCO • No GPL license issues © Copyright 2000-2016 TIBCO Software Inc. TERR - TIBCO’s Enterprise Runtime for R
  31. Which R to use? © Copyright 2000-2016 TIBCO Software Inc.

    http://www.forbes.com/sites/danwoods/2016/01/27/microsofts-revolution-analytics-acquisition-is-the-wrong-way-to-embrace-r/
  32. 49 Apache Spark © Copyright 2000-2016 TIBCO Software Inc. General

    Data-processing Framework  However, focus is especially on Analytics (at least these days) http://fortune.com/2016/09/09/cloudera-spark-mapreduce/
  33. Spark MLlib © Copyright 2000-2016 TIBCO Software Inc. MLlib is

    Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs. You can even combine Mllib module with R language
  34. 52 Apache Spark – Focus on Analytics http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/ http://fortune.com/2016/09/09/cloudera-spark-mapreduce/ http://www.ebaytechblog.com/2016/05/28/using-spark-to-ignite-data-analytics/

    http://www.forbes.com/sites/paulmiller/2016/06/15/ibm-backs-apache-spark-for-big-data-analytics/ “[IBM’s initiatives] include: • deepening the integration between Apache Spark and existing IBM products like the Watson Health Cloud; • open sourcing IBM’s existing SystemML machine learning technology;
  35. H20 © Copyright 2000-2016 TIBCO Software Inc. An Extensible Open

    Source Platform for Analytics • Best of Breed Open Source Technology • Easy-to-use WebUI and Familiar Interfaces • Data Agnostic Support for all Common Database and File Types • Massively Scalable Big Data Analysis • Real-time Data Scoring (Nanofast Scoring Engine) http://www.h2o.ai/
  36. TIBCO Spotfire for Visual Data Discovery © Copyright 2000-2016 TIBCO

    Software Inc. Let the business user leverage historical data to find insights!
  37. TIBCO Spotfire with R / TERR Integration © Copyright 2000-2016

    TIBCO Software Inc. Let the business user leverage Analytic Models (created by the Data Scientist)! Example: Customer Churn with Random Forest Algorithm • ‘refresh model’ button lives a ‘random forest algorithm’ • requires no a priori assumptions at all, it just always works • The business user doesn’t need to know what random forest is to be empowered by it Select variables for the model
  38. SaaS Machine Learning © Copyright 2000-2016 TIBCO Software Inc. •

    Managed SaaS service for building ML models and generating predictions • Integrated into the corresponding cloud ecosystem • Easy to use, but limited feature set and potential latency issues if combined with external data or applications http://docs.aws.amazon.com/machine-learning/latest/dg/tutorial.html
  39. PMML (Predictive Model Markup Language ) © Copyright 2000-2016 TIBCO

    Software Inc. • XML-based de facto standard to represent predictive analytic models • Developed by the Data Mining Group (DMG) • Easily share models between PMML compliant applications (e.g. between model creation and deployment for operations) http://www.ibm.com/developerworks/library/ba-ind-PMML1/
  40. 58 Agenda © Copyright 2000-2016 TIBCO Software Inc. 1) Machine

    Learning and Big Data Analytics 2) Analysis of Historical Data 3) Real Time Processing 4) Live Demo
  41. 59 Analytics Maturity Model © Copyright 2000-2016 TIBCO Software Inc.

    Immediate Long-Term Competitive Advantage Value to the Organization Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Self-service Dashboards Event Processing Analytics
  42. Streaming Analytics © Copyright 2000-2016 TIBCO Software Inc. time 1

    2 3 4 5 6 7 8 9 Event Streams • Continuous Queries • Sliding Windows • Filter • Aggregation • Correlation • …
  43. Operational Intelligence in Action © Copyright 2000-2016 TIBCO Software Inc.

    Actions by Operations Human decisions in real time informed by up to date information The Challenge: Empower operations staff to see and seize key business moments 61 Automated action based on models of history combined with live context and business rules The Challenge: Create, understand, and deploy algorithms & rules that automate key business reactions Machine-to-Machine Automation
  44. 63 Alternatives for Stream Processing © Copyright 2000-2016 TIBCO Software

    Inc. OPEN SOURCE CLOSED SOURCE PRODUCT FRAMEWORK (no complete list!) Azure Microsoft Stream Analytics
  45. Visual IDE (Dev, Test, Debug) Simulation (Feed Testing, Test Generation)

    Live UI (monitoring, proactive interaction) Maturity (24/7 support, consulting) Integration (out-of-the-box: ESB, MDM, etc.) Library (Java, .NET, Python) Query Language (often similar to SQL) Scalability (horizontal and vertical, fail over) Connectivity (technologies, markets, products) Operators (Filter, Sort, Aggregate) What Streaming Alternative do you need? Time to Market Streaming Frameworks Streaming Products Slow Fast Streaming Concepts
  46. 65 Comparison of Stream Processing Frameworks and Products © Copyright

    2000-2016 TIBCO Software Inc. Slide Deck from JavaOne 2016: http://www.kai-waehner.de/blog/2016/10/25/comparison-of-stream-processing-frameworks-and-products/
  47. StreamBase: The Power of Visual Programming © Copyright 2000-2016 TIBCO

    Software Inc. 1) Get ideas into market in days or weeks, not months or years 2) Unlock the power of IT and data scientists working together
  48. © Copyright 2000-2016 TIBCO Software Inc. How to apply analytic

    models to real time processing without rebuilding them ?
  49. Streaming Analytics to operationalize insights and patterns in real time

    without rebuilding the models Stream Processing H20 Open Source R TERR Spark MLlib MATLAB SAS PMML Real Time Close Loop: Understand – Anticipate – Act
  50. 74 Agenda © Copyright 2000-2016 TIBCO Software Inc. 1) Machine

    Learning and Big Data Analytics 2) Analysis of Historical Data 3) Real Time Processing 4) Live Demo
  51. © Copyright 2000-2013 TIBCO Software Inc. “An outage on one

    well can cost $10M per hour. We have 20-100 outages per year.“ - Drilling operations VP, major oil company
  52. Data Monitoring • Motor temperature • Motor vibration • Current

    • Intake pressure • Intake temperature • Flow Electrical power cable Pump Intake Protector ESP motor Pump monitoring unit Pump Components © Copyright 2000-2016 TIBCO Software Inc. Live Surveillance of Equipment
  53. Voltage Temperature Vibration Device history Temporal analytic: “If vibration spike

    is followed by temp spike then voltage spike [within 4 hours] then flag high severity alert.” Predictive Analytics (Fault Management)
  54. Operational Analytics Operations Live UI SENSOR DATA TRANSACTIONS MESSAGE BUS

    MACHINE DATA SOCIAL DATA Streaming Analytics Action Aggregate Rules Stream Processing Analytics Correlate Live Monitoring Continuous query processing Alerts Manual action, escalation HISTORICAL ANALYSIS Data Sheets BI Data Scientists Cleansed Data History Data Discovery Analytics Enterprise Service Bus ERP MDM DB WMS SOA Data Storage Internal Data Integration Bus API Event Server Predictive Maintenance Spark Big Data Machine Data (Sensors, Weather Data, …) Take Action (Stop Machine, Send Mechanic, …) Find Insights (Sensor Behaviour, Hardware Issues, …) ERP System (Transaction History, Production Volume) 2
  55. Operational Analytics Operations Live UI SENSOR DATA TRANSACTIONS MESSAGE BUS

    MACHINE DATA SOCIAL DATA Streaming Analytics Action Aggregate Rules Stream Processing Analytics Correlate Live Monitoring Continuous query processing Alerts Manual action, escalation HISTORICAL ANALYSIS Data Sheets BI Data Scientists Cleansed Data History Data Discovery Analytics Enterprise Service Bus ERP MDM DB WMS SOA Data Storage Internal Data Integration Bus API Event Server Complete Big Data Architecture Spark Big Data
  56. Find Leading Indicators Backtest Rules / Models Push Rules /

    Models to Streambase © Copyright 2000-2016 TIBCO Software Inc. Create a Model
  57. © Copyright 2000-2016 TIBCO Software Inc. Real Time Analytics Trend

    Analysis Combination of Rules CUSUM Analysis Statistical Analysis Statistical Process Control Machine Learning • Location Change – Variable moves up or down • Slope Change – Variable changes trend • Variance Change – Variable becomes more/less volatile • Process Threshold – Shewhart control chart • Failure Model y (0/1) = f (X, b) + e; f = logistic regression, trees, svm, nnet, ...
  58. 1. Rules / models pushed from Spotfire 2. Data streams

    into StreamBase 3. Data evaluated in real-time 4. Spotfire RCA on trigger Other notifications available Live view on streaming data Streambase – from Big Data to Fast Data
  59. Responsible engineer clicks URL to launch Spotfire Root Cause Analysis;

    diagnose issue Compare Live Data with Historical Data to make Human Decision
  60. Key Take-Aways  Insights are hidden in Historical Data on

    Big Data Platforms  Machine Learning and Big Data Analytics find these Insights by building Analytics Models  Event Processing uses these Models (without Rebuilding) to take Action in Real Time