How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real Time Streaming Analytics

HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO
REAL TIME PROCESSING Kai Wähner [email protected] @KaiWaehner www.kai-waehner.de LinkedIn / Xing  Please connect!

2 Digital Transformation - Physical and Digital Worlds are Merging
© Copyright 2000-2016 TIBCO Software Inc.

3 Apply Big Data Analytics to Real Time Processing ©
Copyright 2000-2016 TIBCO Software Inc.

4 Analyse and Act on Critical Business Moments © Copyright
2000-2016 TIBCO Software Inc.

Key Take-Aways  Insights are hidden in Historical Data on
Big Data Platforms  Machine Learning and Big Data Analytics find these Insights by building Analytics Models  Event Processing uses these Models (without Rebuilding) to take Action in Real Time

6 Agenda © Copyright 2000-2016 TIBCO Software Inc. 1) Machine
Learning and Big Data Analytics 2) Analysis of Historical Data 3) Real Time Processing 4) Live Demo

8 Machine Learning © Copyright 2000-2016 TIBCO Software Inc. Machine
learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look. http://www.sas.com

9 10 Examples of Machine Learning © Copyright 2000-2016 TIBCO
Software Inc. • Spam Detection • Credit Card Fraud Detection • Digit Recognition • Speech Understanding • Face Detection • Shape Detection • Product Recommendation • Medical Diagnosis • Stock Trading • Customer Segmentation http://machinelearningmastery.com/practical-machine-learning-problems/

Software Inc. • Spam Detection: Given email in an inbox, identify those email messages that are spam and those that are not. Having a model of this problem would allow a program to leave non-spam emails in the inbox and move spam emails to a spam folder. We should all be familiar with this example. • Credit Card Fraud Detection: Given credit card transactions for a customer in a month, identify those transactions that were made by the customer and those that were not. A program with a model of this decision could refund those transactions that were fraudulent. • Digit Recognition: Given a zip codes hand written on envelops, identify the digit for each hand written character. A model of this problem would allow a computer program to read and understand handwritten zip codes and sort envelops by geographic region. • Speech Understanding: Given an utterance from a user, identify the specific request made by the user. A model of this problem would allow a program to understand and make an attempt to fulfil that request. The iPhone with Siri has this capability. • Face Detection: Given a digital photo album of many hundreds of digital photographs, identify those photos that include a given person. A model of this decision process would allow a program to organize photos by person. Some cameras and software like iPhoto has this capability. http://machinelearningmastery.com/practical-machine-learning-problems/

Software Inc. • Product Recommendation: Given a purchase history for a customer and a large inventory of products, identify those products in which that customer will be interested and likely to purchase. A model of this decision process would allow a program to make recommendations to a customer and motivate product purchases. Amazon has this capability. Also think of Facebook, GooglePlus and Facebook that recommend users to connect with you after you sign-up. • Medical Diagnosis: Given the symptoms exhibited in a patient and a database of anonymized patient records, predict whether the patient is likely to have an illness. A model of this decision problem could be used by a program to provide decision support to medical professionals. • Stock Trading: Given the current and past price movements for a stock, determine whether the stock should be bought, held or sold. A model of this decision problem could provide decision support to financial analysts. • Customer Segmentation: Given the pattern of behaviour by a user during a trial period and the past behaviours of all users, identify those users that will convert to the paid version of the product and those that will not. A model of this decision problem would allow a program to trigger customer interventions to persuade the customer to covert early or better engage in the trial. • Shape Detection: Given a user hand drawing a shape on a touch screen and a database of known shapes, determine which shape the user was trying to draw. A model of this decision would allow a program to show the platonic version of that shape the user drew to make crisp diagrams. The Instaviz iPhone app does this. http://machinelearningmastery.com/practical-machine-learning-problems/

12 Types of Machine Learning Problems © Copyright 2000-2016 TIBCO
Software Inc. • Classification: Data is labelled meaning it is assigned a class, for example spam / non-spam or fraud / non-fraud. • Regression: Data is labelled with a real value (think floating point) rather then a label. Examples that are easy to understand are time series data like the price of a stock over time. • Clustering: Data is not labelled, but can be divided into groups based on similarity and other measures of natural structure in the data. An example from would be organising pictures by faces without names. • Rule Extraction: Data is used as the basis for the extraction of propositional rules (antecedent/consequent aka if-then). An example is the discovery of the relationship between the purchase of beer and diapers. http://machinelearningmastery.com/practical-machine-learning-problems/ (no complete list!)

© Copyright 2000-2016 TIBCO Software Inc. Closed Loop for Big
Data Analytics MODEL Develop model Deploy into Stream Processing flow ACT Automatically monitor real-time transactions Automatically trigger action ANALYZE Analyze data via Data Discovery Uncover patterns, trends, correlations

14 Analytics Maturity Model © Copyright 2000-2016 TIBCO Software Inc.
Immediate Long-Term Competitive Advantage Value to the Organization A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity Self-service Dashboards Event Processing Analytics

Immediate Long-Term Competitive Advantage Value to the Organization Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Analytics

Immediate Long-Term Competitive Advantage Value to the Organization Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Self-service Dashboards Event Processing Analytics

Immediate Long-Term Competitive Advantage Value to the Organization A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity Self-service Dashboards Event Processing Analytics

What is Predictive Analytics?

© Copyright 2000-2016 TIBCO Software Inc. Data Munging / Wrangling
/ Mash-up

cust_id dept sku dollar gift date 1 104 C 12003
2.40 FALSE 2016-10-17 2 105 A 12005 62.85 FALSE 2016-10-17 3 102 C 12007 69.23 TRUE 2016-10-17 4 104 B 12004 9.33 FALSE 2016-10-18 5 105 C 12010 14.16 TRUE 2016-10-18 6 101 B 12003 90.43 FALSE 2016-10-19 7 103 C 12005 90.97 FALSE 2016-10-19 n … … … … … … cust_id A B C total # orders first_date last_date 1 100 21.76 23.67 0.00 45.43 2 2016-10-19 2016-10-20 2 101 0.01 74.65 0.00 74.66 3 2016-10-19 2016-10-20 3 102 0.00 60.92 50.29 111.21 6 2016-10-17 2016-10-20 4 103 0.00 0.00 52.30 52.30 2 2016-10-19 2016-10-20 5 104 31.34 9.33 2.40 43.06 4 2016-10-17 2016-10-20 6 105 62.85 0.00 56.00 118.85 3 2016-10-17 2016-10-20 © Copyright 2000-2016 TIBCO Software Inc. Data Munging - Transformations

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis
that employs a variety of techniques (mostly graphical) 1. to maximize insight into a data set 2. uncover underlying structure 3. extract important variables 4. detect outliers and anomalies 5. test underlying assumptions 6. develop parsimonious models 7. determine optimal factor settings © Copyright 2000-2016 TIBCO Software Inc. Exploratory Data Analysis

“The greatest value of a picture is when it forces
us to notice what we never expected to see” John W. Tukey, 1977 © Copyright 2000-2016 TIBCO Software Inc. Exploratory Data Analysis

Visual Analytics - Interactive Brush-Linked © Copyright 2000-2016 TIBCO Software
Inc.

Immediate Long-Term Competitive Advantage Value to the Organization Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Analytics

What is Predictive Analytics?

© Copyright 2000-2016 TIBCO Software Inc. Which picture represents a
model? A model is a simplification of the truth that helps you with decision making.

© Copyright 2000-2016 TIBCO Software Inc. Model Building Supervised Models
– known, labeled responses • Regression (for example Linear Regression) • Categorical (for example Random Forest) Unsupervised Models – no labeled responses • Clustering (for example k-means clustering)

Employees who write longer emails earn higher salaries! © Copyright
2000-2016 TIBCO Software Inc. Model Building

© Copyright 2000-2016 TIBCO Software Inc. Model Validation How is
the IQ of a kid related to the IQ of his / her mum?

© Copyright 2000-2016 TIBCO Software Inc. What tools do Data
Scientists use?

Data Scientists work with many Tools © Copyright 2000-2016 TIBCO
Software Inc. • SQL • Excel • Python • R Source: O’Reilly 2015 Data Science Salary Survey http://duu86o6n09pv.cloudfront.net/reports/2015- data-science-salary-survey.pdf

44 Alternatives for Data Scientists © Copyright 2000-2016 TIBCO Software
Inc. Open Source Closed Source Tooling Source Code (no complete list) R

R Language R is well known as the most and
increasingly getting more popular programming language used by data scientists for modeling. It is developing very rapidly with a very active community. © Copyright 2000-2016 TIBCO Software Inc.

R with Revolution Analytics (now Microsoft) © Copyright 2000-2016 TIBCO
Software Inc. Open Source GPL License (including its restrictions) http://www.revolutionanalytics.com/webinars/introducing-revolution-r-open-enhanced-open-source-r-distribution-revolution-analytics

• TIBCO has rewritten R as a Commercial Compute Engine
• Latest statistics scripting engine: S a S-PLUS® a R a TERR • Runs R code including CRAN packages • Engine internals rebuilt from scratch at low-level • Redesigned data objects, memory management • High performance + Big Data • TERR is licensed from TIBCO • TERR Installs (free) with Spotfire Analyst / Desktop + other TIBCO products • Spotfire Server can manage all TERR / R scripts, artifacts for reuse • Standalone Developer Edition • Supported by TIBCO • No GPL license issues © Copyright 2000-2016 TIBCO Software Inc. TERR - TIBCO’s Enterprise Runtime for R

Which R to use? © Copyright 2000-2016 TIBCO Software Inc.
http://www.forbes.com/sites/danwoods/2016/01/27/microsofts-revolution-analytics-acquisition-is-the-wrong-way-to-embrace-r/

49 Apache Spark © Copyright 2000-2016 TIBCO Software Inc. General
Data-processing Framework  However, focus is especially on Analytics (at least these days) http://fortune.com/2016/09/09/cloudera-spark-mapreduce/

Spark MLlib © Copyright 2000-2016 TIBCO Software Inc. MLlib is
Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs. You can even combine Mllib module with R language

51 Why Spark is used for Analytics?

52 Apache Spark – Focus on Analytics http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/ http://fortune.com/2016/09/09/cloudera-spark-mapreduce/ http://www.ebaytechblog.com/2016/05/28/using-spark-to-ignite-data-analytics/
http://www.forbes.com/sites/paulmiller/2016/06/15/ibm-backs-apache-spark-for-big-data-analytics/ “[IBM’s initiatives] include: • deepening the integration between Apache Spark and existing IBM products like the Watson Health Cloud; • open sourcing IBM’s existing SystemML machine learning technology;

H20 © Copyright 2000-2016 TIBCO Software Inc. An Extensible Open
Source Platform for Analytics • Best of Breed Open Source Technology • Easy-to-use WebUI and Familiar Interfaces • Data Agnostic Support for all Common Database and File Types • Massively Scalable Big Data Analysis • Real-time Data Scoring (Nanofast Scoring Engine) http://www.h2o.ai/

TIBCO Spotfire for Visual Data Discovery © Copyright 2000-2016 TIBCO
Software Inc. Let the business user leverage historical data to find insights!

TIBCO Spotfire with R / TERR Integration © Copyright 2000-2016
TIBCO Software Inc. Let the business user leverage Analytic Models (created by the Data Scientist)! Example: Customer Churn with Random Forest Algorithm • ‘refresh model’ button lives a ‘random forest algorithm’ • requires no a priori assumptions at all, it just always works • The business user doesn’t need to know what random forest is to be empowered by it Select variables for the model

SaaS Machine Learning © Copyright 2000-2016 TIBCO Software Inc. •
Managed SaaS service for building ML models and generating predictions • Integrated into the corresponding cloud ecosystem • Easy to use, but limited feature set and potential latency issues if combined with external data or applications http://docs.aws.amazon.com/machine-learning/latest/dg/tutorial.html

PMML (Predictive Model Markup Language ) © Copyright 2000-2016 TIBCO
Software Inc. • XML-based de facto standard to represent predictive analytic models • Developed by the Data Mining Group (DMG) • Easily share models between PMML compliant applications (e.g. between model creation and deployment for operations) http://www.ibm.com/developerworks/library/ba-ind-PMML1/

Immediate Long-Term Competitive Advantage Value to the Organization Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Self-service Dashboards Event Processing Analytics

Streaming Analytics © Copyright 2000-2016 TIBCO Software Inc. time 1
2 3 4 5 6 7 8 9 Event Streams • Continuous Queries • Sliding Windows • Filter • Aggregation • Correlation • …

Operational Intelligence in Action © Copyright 2000-2016 TIBCO Software Inc.
Actions by Operations Human decisions in real time informed by up to date information The Challenge: Empower operations staff to see and seize key business moments 61 Automated action based on models of history combined with live context and business rules The Challenge: Create, understand, and deploy algorithms & rules that automate key business reactions Machine-to-Machine Automation

What is Prescriptive Analytics?

63 Alternatives for Stream Processing © Copyright 2000-2016 TIBCO Software
Inc. OPEN SOURCE CLOSED SOURCE PRODUCT FRAMEWORK (no complete list!) Azure Microsoft Stream Analytics

Visual IDE (Dev, Test, Debug) Simulation (Feed Testing, Test Generation)
Live UI (monitoring, proactive interaction) Maturity (24/7 support, consulting) Integration (out-of-the-box: ESB, MDM, etc.) Library (Java, .NET, Python) Query Language (often similar to SQL) Scalability (horizontal and vertical, fail over) Connectivity (technologies, markets, products) Operators (Filter, Sort, Aggregate) What Streaming Alternative do you need? Time to Market Streaming Frameworks Streaming Products Slow Fast Streaming Concepts

65 Comparison of Stream Processing Frameworks and Products © Copyright
2000-2016 TIBCO Software Inc. Slide Deck from JavaOne 2016: http://www.kai-waehner.de/blog/2016/10/25/comparison-of-stream-processing-frameworks-and-products/

StreamBase: The Power of Visual Programming © Copyright 2000-2016 TIBCO
Software Inc. 1) Get ideas into market in days or weeks, not months or years 2) Unlock the power of IT and data scientists working together

67 Dynamic aggregation Live visualization Ad-hoc continuous query Alerts Action
Live Datamart

Streaming Analytics to operationalize insights and patterns in real time
without rebuilding the models Stream Processing H20 Open Source R TERR Spark MLlib MATLAB SAS PMML Real Time Close Loop: Understand – Anticipate – Act

TIBCO StreamBase + R / TERR

TIBCO StreamBase + H20

TIBCO StreamBase + PMML

Real World Application - Customer Churn

BIG DATA AT REST FAST DATA IN MOTION Insight to
Action – Closing the Loop

Data Monitoring • Motor temperature • Motor vibration • Current
• Intake pressure • Intake temperature • Flow Electrical power cable Pump Intake Protector ESP motor Pump monitoring unit Pump Components © Copyright 2000-2016 TIBCO Software Inc. Live Surveillance of Equipment

Voltage Temperature Vibration Device history Temporal analytic: “If vibration spike
is followed by temp spike then voltage spike [within 4 hours] then flag high severity alert.” Predictive Analytics (Fault Management)

Operational Analytics Operations Live UI SENSOR DATA TRANSACTIONS MESSAGE BUS
MACHINE DATA SOCIAL DATA Streaming Analytics Action Aggregate Rules Stream Processing Analytics Correlate Live Monitoring Continuous query processing Alerts Manual action, escalation HISTORICAL ANALYSIS Data Sheets BI Data Scientists Cleansed Data History Data Discovery Analytics Enterprise Service Bus ERP MDM DB WMS SOA Data Storage Internal Data Integration Bus API Event Server Predictive Maintenance Spark Big Data Machine Data (Sensors, Weather Data, …) Take Action (Stop Machine, Send Mechanic, …) Find Insights (Sensor Behaviour, Hardware Issues, …) ERP System (Transaction History, Production Volume) 2

Operational Analytics Operations Live UI SENSOR DATA TRANSACTIONS MESSAGE BUS
MACHINE DATA SOCIAL DATA Streaming Analytics Action Aggregate Rules Stream Processing Analytics Correlate Live Monitoring Continuous query processing Alerts Manual action, escalation HISTORICAL ANALYSIS Data Sheets BI Data Scientists Cleansed Data History Data Discovery Analytics Enterprise Service Bus ERP MDM DB WMS SOA Data Storage Internal Data Integration Bus API Event Server Complete Big Data Architecture Spark Big Data

Leading Indicators Pump Failure

© Copyright 2000-2016 TIBCO Software Inc. Real Time Analytics Trend
Analysis Combination of Rules CUSUM Analysis Statistical Analysis Statistical Process Control Machine Learning • Location Change – Variable moves up or down • Slope Change – Variable changes trend • Variance Change – Variable becomes more/less volatile • Process Threshold – Shewhart control chart • Failure Model y (0/1) = f (X, b) + e; f = logistic regression, trees, svm, nnet, ...

Upon event trigger, populate Spotfire RCA template; email responsible engineer
Put model into Action

1. Rules / models pushed from Spotfire 2. Data streams
into StreamBase 3. Data evaluated in real-time 4. Spotfire RCA on trigger Other notifications available Live view on streaming data Streambase – from Big Data to Fast Data

Live View of the Situation + Proactive Actions

Responsible engineer clicks URL to launch Spotfire Root Cause Analysis;
diagnose issue Compare Live Data with Historical Data to make Human Decision

TIBCO Spotfire + StreamBase + TERR + Live Datamart Live
Demo

Key Take-Aways  Insights are hidden in Historical Data on
Big Data Platforms  Machine Learning and Big Data Analytics find these Insights by building Analytics Models  Event Processing uses these Models (without Rebuilding) to take Action in Real Time

Questions? Please contact me! Kai Wähner [email protected] @KaiWaehner www.kai-waehner.de LinkedIn
/ Xing  Please connect!

How to Apply Machine Learning with R, H20, Apac...

How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real Time Streaming Analytics

More Decks by Kai Waehner

Other Decks in Technology

Featured

Transcript