Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming Analytics - Comparison of Open Source Frameworks and Products

Kai Waehner
September 24, 2015

Streaming Analytics - Comparison of Open Source Frameworks and Products

Stream Processing is a concept used to create a high-performance system for rapidly building applications that analyze and act on real-time streaming data. Benefits, amongst others, are faster processing and reaction to real-time complex event streams and the flexibility to quickly adapt to changing business and analytic needs. Big data, cloud, mobile and internet of things are the major drivers for stream processing and streaming analytics.

This session discusses the technical concepts of stream processing and how it is related to big data, mobile, cloud and internet of things. Different use cases such as predictive fault management or fraud detection are used to show and compare alternative frameworks and products for stream processing and streaming analytics.

The audience will understand when to use open source frameworks such as Apache Storm, Apache Spark or Esper, and powerful engines from software vendors such as IBM InfoSphere Streams or TIBCO StreamBase. Live demos will give the audience a good feeling about how to use these frameworks and tools.

The session will also discuss how stream processing is related to Hadoop and statistical analysis with software such as SAS, Apache Spark’s MLlib or R language.

Kai Waehner

September 24, 2015
Tweet

More Decks by Kai Waehner

Other Decks in Technology

Transcript

  1. Fast Data and Streaming Analytics in the Era of Hadoop,

    R and Apache Spark Kai Wähner [email protected] @KaiWaehner www.kai-waehner.de LinkedIn / Xing  Please connect!
  2. Key Messages – Streaming Analytics processes Data while it is

    in Motion! – Automation and Proactive Human Interaction are BOTH needed! – Time to Market is the Key Requirement for most Use Cases!
  3. Agenda – Real World Use Cases – Introduction to Stream

    Processing – Market Overview – Relation to other Big Data Components
  4. Agenda – Real World Use Cases – Introduction to Stream

    Processing – Market Overview – Relation to other Big Data Components
  5. © Copyright 2015 TIBCO Software Inc. Find and Act on

    “Critical Business Moments” “Business Moments” occur in Every Facet of Enterprise Operations, they drive competitive differentiation, customer satisfaction and business success! Optimize Pricing Identify fraud Make cross- sell offers Restock inventory Reroute trucks Deliver proactive customer service Predict equipment failure & fix proactively Anticipate and handle disruptions
  6. Operational Intelligence in Action © Copyright 2000-2015 TIBCO Software Inc.

    Actions by Operations Human decisions in real time informed by up to date information The Challenge: Empower operations staff to see and seize key business moments 6 Automated action based on models of history combined with live context and business rules The Challenge: Create, understand, and deploy algorithms & rules that automate key business reactions Machine-to-Machine Automation
  7. © Copyright 2000-2013 TIBCO Software Inc. “An outage on one

    well can cost $10M per hour. We have 20-100 outages per year.“ - Drilling operations VP, major oil company
  8. Data Monitoring • Motor temperature • Motor vibration • Current

    • Intake pressure • Intake temperature  Flow Electrical power cable Pump Intake Protector ESP motor Pump monitoring unit Electric Submersible Pumps (ESP) Predictive Analytics (Fault Management)
  9. Voltage Temperature Vibration Device history Temporal analytic: “If vibration spike

    is followed by temp spike then voltage spike [within 12 minutes] then flag high severity alert.” Predictive Analytics (Fault Management)
  10. Live Surveillance of Equipment © Copyright 2000-2014 TIBCO Software Inc.

    Continuous, live geospatial display of pump health and predictive signal breeches Alerts based on predictive signals Compare live readings and signals to historical average and means Continuous, live visualization of stats per 100’s of wells
  11. IoT for High Tech Manufacturing Yield Optimization © Copyright 2000-2014

    TIBCO Software Inc. • Before: Solar Panel Manufacturer with No Unified View of Manufacturing Process – Multiple manufacturing facilities, multiple processes – no way to compare production to yield expectations • Negative Consequences: Sub-Optimal Production – Operations are sub-optimal: high tolerance leads to better yield but less output; tight tolerance means high throughput but lower yield • Business Outcome: Higher Yield and More Runs – Process Manufacturing can run tighter tolerances and adjust them mid-run, predicting yield and adjusting to changing variables – Systems proactively re-route high-value customers around affected network areas in real-time • How We Do It: The TIBCO Fast Data Platform – IoT, Spotfire, StreamBase, and TERR for predictive modeling, high-speed network by TIBCO “For every 1% increase in shipped product, we make $11MM in profit. The demand is there, we just need to fulfill it.” - Head of Quality, Solar Panel Manufacturer
  12. High Tech Manufacturing Yield Optimization © Copyright 2000-2014 TIBCO Software

    Inc. Live streaming datamart analysis Continuous update and exploration of top yield metrics; take action
  13. High Tech Manufacturing Yield Optimization © Copyright 2000-2014 TIBCO Software

    Inc. Continuously computed real-time analytics on streams by StreamBase (thresholds, min / max, average) Analysis, alerts and triggers are based on streaming analytics
  14. High Tech Manufacturing Yield Optimization © Copyright 2000-2014 TIBCO Software

    Inc. Manufacturing operations staff drill down on any machine, any time, to inspect and fix problems before they impact yield
  15. Challenges of the 21st Century Retailer • Retailing and Retail

    Challenges are changing • Consumers expect better and integrated customer experience across all channels – Rapid adoption of mobile is a major driver – Customers want an integrated service across physical and digital channels… Simultaneously – Customer experience is becoming one of the main differentiators • Real-Time, one-on-one marketing can: – Improve a retailer’s relevance with the customer – Increase customer wallet-share • Key to being able to achieve this is: – Identifying and knowing your customer, in depth in real-time – Understanding the opportunity their past behavior reveals – Understanding your inventory (availability, velocity, pipeline)
  16. 29 © Copyright 2000-2014 TIBCO Software Inc. All Customers are

    different… Treat them that way… Capture – Engage – Expand - Monetize Patterns – Real time MORE PERSONAL MORE CONTEXT social CRM POS mobile web e-mails
  17. National Retailer Loyalty 2015 © Copyright 2000-2015 TIBCO Software Inc.

    Top Benefits • Smart cross-selling based in iBeacons • Location-based services in real time • Leveraging partner offerings
  18. New Real-Time Fraud Detection Based on Deep Historical Insight Real-time

    fraud action can be taken based on historical insight – system not “whiplashed” by real-time events Streaming Analytics for Gift Card Fraud Protection
  19. 32 © Copyright 2000-2015 TIBCO Software Inc. Internet of Things

    Hybrid Stores Smart Tags Smart Shelves Smart Warehouse Faster Delivery Buy Online Pickup at Store Same Day Delivery Omni Channel 2.0 Store Fulfillment Social Media Predictive Shopping National Retailer Loyalty 2018
  20. 33 Great success stories, but … © Copyright 2000-2015 TIBCO

    Software Inc. … how to realize these use cases?
  21. 34 © Copyright 2000-2014 TIBCO Software Inc. Real Time Close

    Loop Model Develop model Deploy into Stream Processing flow Act Automatically monitor real-time transactions Automatically trigger action Analyze Analyze data via Data Discovery Uncover patterns, trends, correlations
  22. Agenda – Real World Use Cases – Introduction to Stream

    Processing – Market Overview – Relation to other Big Data Components
  23. Traditional Data Processing: Challenges • Introduces too much “decision latency”

    into the business. • Responses are delivered “after-the- fact”. • Maximum value of the identified situation is lost. – Cross-sell / up-sell opportunities are lost, impending equipment failure is missed, business processes are slow to respond and lack timely context. • Decisions are made on old and stale data. © Copyright 2000-2015 TIBCO Software Inc. Store Analyze Act
  24. The New Era: Fast Data Processing • Events are analyzed

    and processed in real-time as they arrive. • Decisions are timely, contextual, and based on fresh data. • Decision latency is eliminated, resulting in:  Superior Customer Experience  Operational Excellence  Instant Awareness and Timely Decisions © Copyright 2000-2015 TIBCO Software Inc. Act Analyze Store
  25. Streaming Analytics © Copyright 2000-2015 TIBCO Software Inc. time 1

    2 3 4 5 6 7 8 9 Event Streams • Continuous Queries • Sliding Windows • Filter • Aggregation • Correlation • …
  26. 39 Act while data is in motion! Time Business Value

    Business Event Data Ready for Analysis Analysis Completed Decision Made $$$$ $$$ $$ $ Action Taken Stream Processing speeds action and increases business value by seizing opportunities while they matter
  27. Operational Analytics Operations Live UI SENSOR DATA TRANSACTIONS MESSAGE BUS

    MACHINE DATA SOCIAL DATA Streaming Analytics Action Aggregate Rules Stream Processing Analytics Correlate Live Datamart Continuous query processing Alerts Manual action, escalation HISTORICAL ANALYSIS MS Excel SAS Data Scientists Cleansed Data History Data Discovery R Enterprise Service Bus ERP MDM DB WMS SOA BIG DATA Data Warehouse, Hadoop Internal Data Integration Bus API Event Server Streaming Analytics Reference Architecture Spark
  28. Agenda – Real World Use Cases – Introduction to Stream

    Processing – Market Overview – Relation to other Big Data Components
  29. Operational Analytics Operations Live UI SENSOR DATA TRANSACTIONS MESSAGE BUS

    MACHINE DATA SOCIAL DATA Streaming Analytics Action Aggregate Rules Stream Processing Analytics Correlate Live Datamart Continuous query processing Alerts Manual action, escalation HISTORICAL ANALYSIS MS Excel SAS Data Scientists Cleansed Data History Data Discovery R Enterprise Service Bus ERP MDM DB WMS SOA BIG DATA Data Warehouse, Hadoop Internal Data Integration Bus API Event Server Streaming Analytics Reference Architecture Spark
  30. 44 Alternatives for Stream Processing Time to Market Streaming Frameworks

    Streaming Products Slow Fast Streaming Concepts Includes Includes © Copyright 2000-2015 TIBCO Software Inc.
  31. Concepts (Continuous Queries, Sliding Windows) Patterns (Counting, Sequencing, Tracking, Trends)

    Build everything by yourself!  45 What Streaming Alternative do you need? Time to Market Streaming Frameworks Streaming Products Slow Fast Streaming Concepts © Copyright 2000-2015 TIBCO Software Inc.
  32. 46 Usually not an option ... © Copyright 2000-2015 TIBCO

    Software Inc. … as there are a lot of Frameworks and Products available!
  33. 47 Alternatives © Copyright 2000-2015 TIBCO Software Inc. OPEN SOURCE

    CLOSED SOURCE PRODUCT FRAMEWORK (no complete list!)
  34. Library (Java, .NET, Python) Query Language (often similar to SQL)

    Scalability (horizontal and vertical, fail over) Connectivity (technologies, markets, products) Operators (Filter, Sort, Aggregate) 48 What Streaming Alternative do you need? Time to Market Streaming Frameworks Streaming Products Slow Fast Streaming Concepts © Copyright 2000-2015 TIBCO Software Inc.
  35. 50 Apache Storm – Hello World © Copyright 2000-2015 TIBCO

    Software Inc. http://wpcertification.blogspot.ch/2014/02/helloworld-apache-storm-word-counter.html
  36. 53 Amazon Kinesis – The Cloud ... © Copyright 2000-2015

    TIBCO Software Inc. … is easy to setup and scale! But you do not have full control  • Any data that is older than 24 hours is automatically deleted • Every Kinesis application consists of just one procedure, so you can’t use Kinesis to perform complex stream processing unless you connect multiple applications • Kinesis can only support a maximum size of 50KB for each data item http://diamondstream.com/amazon-kinesis-big-real-time-data-processing-solution/ (blog post from 2014, might be outdated, but shows that you do not have full control over a cloud service)
  37. 54 Apache Spark © Copyright 2000-2015 TIBCO Software Inc. General

    Data-processing Framework  However, focus is especially on Analytics (these days) http://fortune.com/2015/09/09/cloudera-spark-mapreduce/
  38. 55 Apache Spark – Focus on Analytics © Copyright 2000-2015

    TIBCO Software Inc. http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/ http://fortune.com/2015/09/09/cloudera-spark-mapreduce/ http://www.ebaytechblog.com/2014/05/28/using-spark-to-ignite-data-analytics/ http://www.forbes.com/sites/paulmiller/2015/06/15/ibm-backs-apache-spark-for-big-data-analytics/ “[IBM’s initiatives] include: • deepening the integration between Apache Spark and existing IBM products like the Watson Health Cloud; • open sourcing IBM’s existing SystemML machine learning technology;
  39. 56 Spark Streaming © Copyright 2000-2015 TIBCO Software Inc. Spark

    Streaming • is no real streaming solution • uses micro-batches • cannot process data in real-time (i.e. no ultra-low latency) • allows easy combination with other Spark components (SQL, Machine Learning, etc.)
  40. 57 Apache Spark – Hello World © Copyright 2000-2015 TIBCO

    Software Inc. Spark Streaming API Spark Core API
  41. 58 Alternatives © Copyright 2000-2015 TIBCO Software Inc. OPEN SOURCE

    CLOSED SOURCE PRODUCT FRAMEWORK (no complete list!)
  42. Visual IDE (Dev, Test, Debug) Simulation (Feed Testing, Test Generation)

    Live UI (monitoring, proactive interaction) Maturity (24h support, consulting) Integration (ootb integration: ESB, MDM, etc.) Library (Java, .NET, Python) Query Language (often similar to SQL) Scalability (horizontal and vertical, fail over) Connectivity (technologies, markets, products) Operators (Filter, Sort, Aggregate) What Streaming Alternative do you need? Time to Market Streaming Frameworks Streaming Products Slow Fast Streaming Concepts
  43. 61 IBM InfoSphere Streams © Copyright 2000-2015 TIBCO Software Inc.

    https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2014/04/Streams-and-Storm-April-2014-Final.pdf
  44. TIBCO StreamBase • Performance: Latency, Throughput, Scalability – Multi-threaded and

    clustered server from version 1 – High throughput: Millions of messages, 100,000s of quotes, 10,000s of orders – Low-latency: microsecond latency for algo trading, pre-trade risk, market data • Take Advantage of High Performance Hardware – Multicore (12, 24, 32 core) large memory (10s of gigabytes) – 64-bit Linux, Windows, Solaris deployment – Hardware acceleration (GPU, Solace, Tervela) • Enterprise Deployment – High availability and fault tolerance – Distributed state management for large data sets – Management and monitoring tools – Security and entitlements Integration – Continuous deployment and QA Process Support StreamSQL compiler and static optimizer In process, in thread adapter architecture Visual parallelism and scaling ActiveSpaces integration for distributed shared state Data parallelism and dispatch StreamBase Server Innovations “The StreamBase engine is for real. We couldn’t break it, and believe me, I tried” SVP Development, Top 5 Broker Dealer
  45. StreamBase: The Power of Visual Programming © Copyright 2000-2015 TIBCO

    Software Inc. 1) Get ideas into market in days or weeks, not months or years 2) Unlock the power of IT and data scientists working together
  46. 64 © Copyright 2000-2013 TIBCO Software Inc. Code Anyone Can

    Read Limit Gift Card Activation Amounts at One Location Aggregate Capture card activations per location Sales too high! Log to any database No Fraud Sales too high?
  47. Visual Debugger Feed Simulation Unit Testing “StreamBase’s modeling tools are

    easy to use and will enable the exchange to quickly react to the ever changing needs of our customers.” Steve Goldman, Director of Enterprise Architecture StreamBase Development Studio
  48. Live Datamart Continuous Query Processor Alerts BusinessEvents FTL EMS ActiveSpaces

    Live Datamart BusinessWorks Social Media Data Market Data Sensor Data Historical Data ActiveSpaces Datagrid Enterprise data Market Data IoT Mobile Social LiveView Desktop Command & Control ACTION Continuous Query
  49. Live Datamart Clients and APIs • Rich Desktop Client –

    Drag&Drop, no coding • Rich Web Client – Drag&Drop, no coding • HTML5 and Javascript API – D3, jQuery, ExtJS, Google Charts, Bing, AngularJS • .NET API – For custom .NET development • Java API – For custom Java GUI development • Combination – Rich Client + HTML5 Extensions
  50. 70 Spoilt for Choice – Which one to choose? ©

    Copyright 2000-2015 TIBCO Software Inc. What are the key aspects?
  51. 71 What do you need (out-of-the-box)? © Copyright 2000-2015 TIBCO

    Software Inc. • A stream processing programming language for streaming analytics • Visual development and debugging instead of coding • Out-of-the-box connectivity to streaming and historical data sources • Performance (real-time vs. micro-batches) • Automated monitoring and alerts • Live UI for proactive human interaction • Maturity and proven deployments • Fault tolerance • Commercial support • Professional services and training
  52. 72 Spoilt for Choice – Framework or Product? © Copyright

    2000-2015 TIBCO Software Inc. Does it make sense to combine both?
  53. Example: Apache Storm + TIBCO Live Datamart External Data Snapshot

    Results Continuous Query Processor Query TIBCO Live Datamart Continuous Alerting Active Tables Active Tables Continuous Updates Clients Message Bus Public Data Customer Data StreamBase Bolt StreamBase Spout Operational Data StreamBase Bolt and Spout connect Apache Storm to StreamBase to provide real-time analytics on operational data
  54. Agenda – Real World Use Cases – Introduction to Stream

    Processing – Market Overview – Relation to other Big Data Components
  55. Operational Analytics Operations Live UI SENSOR DATA TRANSACTIONS MESSAGE BUS

    MACHINE DATA SOCIAL DATA Streaming Analytics Action Aggregate Rules Stream Processing Analytics Correlate Live Datamart Continuous query processing Alerts Manual action, escalation HISTORICAL ANALYSIS MS Excel SAS Data Scientists Cleansed Data History Data Discovery R Enterprise Service Bus ERP MDM DB WMS SOA BIG DATA Data Warehouse, Hadoop Internal Data Integration Bus API Event Server Streaming Analytics Reference Architecture Spark
  56. 76 © Copyright 2000-2014 TIBCO Software Inc. Real Time Close

    Loop Model Develop model Deploy into Stream Processing flow Act Automatically monitor real-time transactions Automatically trigger action Analyze Analyze data via Data Discovery Uncover patterns, trends, correlations
  57. Real Time Close Loop: Understand – Anticipate – Act Big

    Data  store everything in Hadoop, DWH, NoSQL, etc.  even without structure  even if you do not need it today http://blogs.teradata.com/international/tag/hadoop/
  58. Real Time Close Loop: Understand – Anticipate – Act Data

    Discovery + Statistics + Machine Learning to find insights and patterns in historical data
  59. Real Time Close Loop: Understand – Anticipate – Act Streaming

    Analytics to operationalize insights and patterns in real time Stream Processing Hadoop Open Source R TERR SAS MATLAB In- database analytics Spark
  60. R with Revolution Analytics (now Microsoft) © Copyright 2000-2015 TIBCO

    Software Inc. Open Source GPL License http://www.revolutionanalytics.com/webinars/introducing-revolution-r-open-enhanced-open-source-r-distribution-revolution-analytics
  61. R with TIBCO Runtime for R (TERR) TIBCO TERR delivers

    production-grade R analytics to enterprises  Flexibility & analytic power of R language  Time-to-market agility  Enterprise-grade platform • A TIBCO licensed & supported product • Not GPL, not a repackaging of the Open source R engine • Deployment in TIBCO products and 3rd party applications (e.g. Hadoop) http://spotfire.tibco.com/discover-spotfire/what-does-spotfire-do/predictive-analytics/tibco-enterprise-runtime-for-r-terr
  62. Use Open Source R or Not? © Copyright 2000-2015 TIBCO

    Software Inc. http://www.forbes.com/sites/danwoods/2015/01/27/microsofts-revolution-analytics-acquisition-is-the-wrong-way-to-embrace-r/
  63. Spark MLlib © Copyright 2000-2015 TIBCO Software Inc. MLlib is

    Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs. You can even combine Mllib module with R language
  64. Case Study: Streaming Analytics for Betting • Situation: Today, 80%

    of Betting is Done After the Game Starts • It’s not your father’s bookie anymore! • Problem: How to Analyze Big Betting Data? • Thousands of concurrent games, constantly adjusting odds, dozens of betting networks – firms must correlate millions of events a day to find the best betting opportunities in real-time • Solution: TIBCO for Fast Data Architecture • TXOdds uses TIBCO to correlate, aggregate, and analyze large volumes of streaming betting data in real-time and publish innovative predictive betting analytics to their customers • Result: TXOdds First to Market with Innovative Zero Latency Betting Analytics • Innovative real-time analytics help players who can process electronic data in real-time the edge “With StreamBase, in two months we had our first betting analytics feed live, and we continually deploy new ideas and evolve our old ones.” - Alex Kozlenkov, VP of technology, TXOdds
  65. 87 “WHEN 5 KEY BOOKIES RAISE THE SAME ODDS IN

    A 5-SECOND WINDOW, BET LESS” ? ? ? ? ? ? ? ? ?
  66. 88 “WHEN THE REAL-TIME ODDS ARE 5% GREATER THAN THE

    HISTORICAL SPREAD, INCREASE MY BET” ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
  67. Reference Architecture: Streaming Betting Analytics Event Processing MONITOR REAL-TIME ANALYTICS

    AGGREGATE HISTORICAL COMPARISON Predictive odds analytics Zero Latency Betting Analytics GLOBAL, DISTRIBUTED INFRASTRUCTURE Historical odds deviations B U S BETTING LINES SCORES NEWS HADOOP Context: Historical Betting Data, Odds, Outcomes B U S CACHE CACHE CACHE Real-Time Analytics CORRELATE StreamBase LiveView SOCIAL
  68. Twitter (#TomBradyBrokenLeg) Twitter (#Boston) Brady’s Stats Actionable Insights Real-Time Social

    Media Analytics Twitter (#NFL) Something relevant happening? Every minute counts! Change Odds (automated or manually triggered): • Stop live-betting for the currently running game? • How many interceptions will the Quarterback throw? • Will the Patriots win the Super Bowl? • …
  69. – Streaming Analytics processes Data while it is in Motion!

    – Automation and Proactive Human Interaction are BOTH needed! – Time to Market is the Key Requirement for most Use Cases! Key Messages