New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data Spain 2014

by Big Data Spain

Slide 1

Slide 1 text

NEW USAGE MODEL FOR REAL-TIME ANALYTICS WILLIAM L. BAIN CEO AT SCALEOUT SOFTWARE, INC. SCALEOUT SOFTWARE, INC.

Slide 2

Slide 2 text

Slide 3

Slide 3 text

2 ScaleOut Software, Inc. • What Is Operational Intelligence? • Example: Tracking Cable Viewers • Implementing OI Using an In-Memory Data Grid: • Distributing the Data Across a Cluster • Integrating Data-Parallel Analysis • Building an In-Memory Model • More Examples of In-Memory Models • Comparison to Spark and Storm • Implementing an Example in Financial Services • Using In-Memory Hadoop MapReduce for OI Agenda

Slide 4

Slide 4 text

3 ScaleOut Software, Inc. • Dr. William Bain, Founder & CEO • Career focused on parallel computing – Bell Labs, Intel, Microsoft • 3 prior start-ups, last acquired by Microsoft and product now ships as Network Load Balancing in Windows Server • ScaleOut Software develops and markets In-Memory Data Grids, software middleware for: • Scaling application performance and • Providing operational intelligence using • In-memory data storage and computing • Nine years in the market, 400 customers, 10,000 servers; sample customers: About the Speaker

Slide 5

Slide 5 text

4 ScaleOut Software, Inc. Goal: Provide immediate feedback to a system handling live data. A few examples: • Ecommerce: for personalized, real-time recommendations • Equity trading: to minimize risk during a trading day • Reservations systems: to identify issues, reroute, etc. • Credit cards & wire transfers: to detect fraud in real time • Smart grids: to optimize power distribution & detect issues Online Systems Need Operational Intelligence

Slide 6

Slide 6 text

5 ScaleOut Software, Inc. • Goals: • Make real-time, personalized upsell offers. • Immediately respond to service issues. • Track aggregate behavior to identify patterns, e.g.: • Total instantaneous incoming event rate • Most popular programs and # viewers by zip code • Requirements: • Track events from 10M cable boxes with 25K events/sec (2.2B/day). • Correlate, cleanse, and enrich events per rules (e.g. ignore fast channel switches, match channels to programs). • Be able to feed enriched events to recommendation engine within 5 sec. • Immediately examine any cable box (e.g., box status) & track statistics. Example: Track Cable TV Viewers ©2011 Tammy Bruce presents LiveWire

Slide 7

Slide 7 text

6 ScaleOut Software, Inc. Based on a simulated workload for San Diego metropolitan area: • Continuously correlates and enriches telemetry from 10M simulated set-top boxes (from synthetic load generator). • Processes more than 30K events/second. • Enriches events with program information every second. • Tracks aggregate statistics (e.g., top 10 programs by zip code) every 10 secs. The Result: An OI Platform Real-Time Dashboard

Slide 8

Slide 8 text

7 ScaleOut Software, Inc. Big Data Analytics Real-Time vs. Batch Analytics Static data sets Petabytes Disk storage Minutes to hours Best uses: • Analyzing warehoused data • Mining for long- term trends Live data sets Gigabytes to terabytes In-memory storage Seconds to minutes Best uses: • Tracking live data • Immediately identifying trends and capturing opportunities • Providing immediate feedback Analytics Server hServer Hadoop IBM Teradata SAS SAP Real-Time Batch Real-time “Operational Intelligence” Batch “Business Intelligence”

Slide 9

Slide 9 text

8 ScaleOut Software, Inc. • Operational intelligence can co-exist with business intelligence: • Processes streaming data close to its sources. • Provides real-time, “tactical” feedback (e.g., recommendations, alerts). • Transforms data for storage in the data warehouse (ETL). • Data warehouse provides “strategic” guidance. • Using the same tool set (e.g., Hadoop MapReduce) lowers TCO: • Leverages common skill set. • Simplifies design (e.g., loading data into HDFS). Integrated View of Analytics

Slide 10

Slide 10 text

9 ScaleOut Software, Inc. • To keep up with fast growing “live” workloads & maintain fast response times: • Track state of entities within a live system. • Reliably process updates to data set in real-time. • To identify and respond to trends in fast-changing data: • Enrich & evaluate “live” data set in real time. • Respond to identified patterns within seconds. Challenges for Operational Intelligence 0 50 100 150 200 250 300 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Millions Growth in Web Servers Source: Netcraft 0 500 1000 1500 2000 2500 3000 3500 4000 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Exebytes Growth in “Big Data” “More data has been created in the past three years than in the past 40,000.”

Slide 11

Slide 11 text

10 ScaleOut Software, Inc. • In-memory data grid (IMDG) holds active entities undergoing state changes in memory. • Backing store optionally holds large population of entities. • IMDG processes incoming stream of state changes. • Analytics engine examines entities in real time and generates alerts within seconds as needed. In-Memory Architecture for Operational Intelligence

Slide 12

Slide 12 text

11 ScaleOut Software, Inc. In-Memory Data Grid (IMDG) stores “live” data in a cluster: • Fits in the business logic layer: • Follows object-oriented view of data (vs. relational view). • Stores collections of Java/.NET/C++ objects shared by multiple clients. • Uses create/read/update/delete and query APIs to access data. • Implemented across a cluster of servers or VMs: • Scales storage and throughput by adding servers. • Provides high availability in case a server fails. In-Memory Data Grid

Slide 13

Slide 13 text

12 ScaleOut Software, Inc. • IMDG’s collections of objects act like process collections: • Unstructured, typically instances of a class (stored as serialized blobs) • Individually accessible / update-able • IMDG adds attributes: • Accessible by global key • Query-able by properties • Highly available • Optional timeouts • Distributed locking • Integration with a backing store • Optional dependency relationships • Asynchronous event handling IMDGs Use Object-Oriented Model Basic “CRUD” APIs: • Create(key, obj, tout) • Read(key) • Update(key, obj) • Delete(key) and… • Lock(key) • Unlock(key) Object key

Slide 14

Slide 14 text

13 ScaleOut Software, Inc. In-Memory, Data-Parallel Computing • Integrates with IMDG data storage to minimize data motion. • Ex.: Parallel Method Invocation (PMI), an object-oriented version of data-parallel computing from the HPC community: • Selects objects using a parallel query on data hosted in the IMDG. • Runs user-defined methods in parallel across the cluster and merges results. Analyze Data (Eval) Combine Results (Merge) In-Memory Data Grid Runs Data-Parallel Computation.

Slide 15

Slide 15 text

14 ScaleOut Software, Inc. Achieving Linear Speedup Avoid data motion (network or disk I/O) which limits throughput:

Slide 16

Slide 16 text

15 ScaleOut Software, Inc. Object-oriented model tracks and analyzes real-world entities: In-Memory Model of “Live” Entities In-Memory State in “IMDG” NoSQL Storage Real-Time Data Parallel Analysis

Slide 17

Slide 17 text

16 ScaleOut Software, Inc. • Each cable box is represented as an object in the IMDG: • Object holds raw & enriched event streams, viewer parameters, and statistics. • IMDG captures incoming events by updating objects. • IMDG uses data-parallel computation to: • immediately enrich box objects to generate alerts to recc. engine, and • continuously collect and report global statistics. Example: Cable Set-Top Boxes

Slide 18

Slide 18 text

17 ScaleOut Software, Inc. Fast map/reduce reconciles inventory and order systems for an online retailer: • Challenge: Inventory and online order management are handled by different applications. • Reconciled once per day. • Inaccurate orders reduces margins. • Solution: • Host SKUs in IMDG updated in real time by order & inventory systems. • Use MapReduce to reconcile in two minutes. • Enables real-time reconciliation to ensure accurate orders. Example in Ecommerce: Inventory Management

Slide 19

Slide 19 text

18 ScaleOut Software, Inc. • IMDG holds customer information for active Web users. • IMDG saves/retrieves customer information from backing store. • Web browsers send activity information to analytics engine. • IMDG updates customer history and preferences. • Analytics engine identifies browsing and buying patterns. • Analytics engine makes suggestions in real-time. Also sends email follow-ups. Example: Web Shopping

Slide 20

Slide 20 text

19 ScaleOut Software, Inc. Brick and mortar stores use OI to compete with online experience: • IMDG tracks opt-in customers to make recommendations. • RFID tags identify product selection and availability in showroom. • Analytics engine sends real-time advisories to sales staff via tablet. Example: Retail Shopping

Slide 21

Slide 21 text

20 ScaleOut Software, Inc. Focus: accelerating business intelligence using in-memory computing: • In-memory computing to accelerate and extend Hadoop MapReduce using data-parallel operators in Scala. • Stores data as “resilient distributed datasets” (RDDs): • Distributed across cluster • Immutable • Hold data from/output to HDFS. • Manages data stream as a sequence of RDDs. • Comparison to IMDG: • Not designed for operational systems: • Lacks high availability (uses lineage). • Intended for data-parallel operations: • Lacks CRUD APIs on individual objects. Comparison: IMDGs to Spark

Slide 22

Slide 22 text

21 ScaleOut Software, Inc. • Focus: continuous processing of input streams • Storm implements pipelined execution of tasks by “bolts” on incoming data streams. • Streams can be distributed to bolts with configurable mappings. • Developer controls the number of tasks per bolt. • Storm uses a centralized master node and Zookeeper for fault-tolerance. • Issues: • Managing global state • Minimizing data motion • Complexity / tuning Comparison to Storm

Slide 23

Slide 23 text

22 ScaleOut Software, Inc. • Hedge fund tracks a set of hedging strategies: • Strategies can cover various market sectors, such as high-tech, automotive, energy, consumer, real estate, etc. • Each strategy contains list of holdings and rules for managing the holdings (such as target allocations). • Updates to market data continuously arrive during the trading day. • The challenge: update and analyze a large population of hedging strategies to immediately alert traders. Implementing an Example in FinServ

Slide 24

Slide 24 text

23 ScaleOut Software, Inc. • The IMDG holds hedging strategies as an object-oriented collection. • Updates to market data are managed as a series of snapshot objects. • The IMDG performs repeated data-parallel analysis on hedging strategies to generate alerts. • Merges alerts and feeds to traders in real time. • IMDG automatically and dynamically scales its throughput to handle new hedging strategies by adding servers. In-Memory Model

Slide 25

Slide 25 text

24 ScaleOut Software, Inc. Step 1: Select all objects using parallel query of strategy objects: • Query spec matches data’s object-oriented properties. • Selected objects are fed to the analysis engine on each local server. Implementing the Analysis

Slide 26

Slide 26 text

25 ScaleOut Software, Inc. Java Example: Parallel Query public class Portfolio { private long id; private Set longPositions; private Set shortPositions; private double totalValue; private Region region; private boolean alerted; // alert for trading @SossIndexAttribute // query-able property public double getTotalValue() {…} @SossIndexAttribute // query-able property public Region getRegion() {…} public Set evalPositions(MarketSnapshot ms) {…}; } NamedCache pset = CacheFactory.getCache(“portfolios"); Set res = pset.queryObjects(Portfolio.class, and(greaterThan(“totalValue”, 1000000), equals(“region”, Region.US)));

Slide 27

Slide 27 text

26 ScaleOut Software, Inc. Step 2: Create parallel methods to update and analyze the queried collection of hedging strategies: • “Eval” method applies market snapshot to an instance of a strategy object: • Compare to a MapReduce mapper; adds an input parameter. • Updates the strategy object’s positions. • Analyzes the positions for a deviation from allowed rules. • Optionally generates an alert. • “Merge” method combines alerts across the collection of strategies: • Compare to a MapReduce combiner. • Uses binary combining. • Is applied globally to the object collection by the IMDG (unlike a Mapreduce reducer). • Note: both methods access hydrated objects; avoid need for CRUD access. Implementing the Analysis

Slide 28

Slide 28 text

27 ScaleOut Software, Inc. • Create method to analyze a queried portfolio and another method to pair-wise merge the result sets of alerted portfolios: Java Example: Parallel Method Invocation public class PortfolioAnalysis implements Invokable> { public Set eval(Portfolio p, MarketSnapshot ms) throws InvokeException { // update portfolio and return id if alerted: return p.evalPositions(ms); } public Set merge(Set set1, Set set2) throws InvokeException { set1.addAll(set2); return set1; // merged set of alerted portfolio ids }}

Slide 29

Slide 29 text

28 ScaleOut Software, Inc. • Run a parallel method invocation on a queried set of portfolios and return set of ids for alerted portfolios: Java Example: Parallel Method Invocation NamedCache pset = CacheFactory.getCache(“portfolios"); InvokeResult alertedPortolios = pset.invoke( PortfolioAnalysis.class, Portfolio.class, and(greaterThan(“totalValue”, 1000000), // query spec equals(“region”, Region.US)), marketSnapshot, // parameters ... ); System.out.println("The alerted portfolios are" + alertedPortfolios.getResult());

Slide 30

Slide 30 text

29 ScaleOut Software, Inc. • IMDG ships user’s code and libraries to its servers. • IMDG automatically schedules analysis operations across all grid servers and cores: • The analysis runs on all objects selected by the parallel query. • Each grid server analyzes its locally stored objects to minimize data motion. • Parallel execution ensures fast completion time: • IMDG automatically distributes workload across servers/cores. • Scaling the IMDG automatically handles larger data sets. Running the Analysis

Slide 31

Slide 31 text

30 ScaleOut Software, Inc. • The IMDG automatically merges all analysis results: • The IMDG first merges all results within each grid server in parallel. • It then merges results across all grid servers to create one combined result. • Efficient parallel merge minimizes the delay in combining all results. • The IMDG delivers the combined result to the invoking application as one object. Merging the Results

Slide 32

Slide 32 text

31 ScaleOut Software, Inc. • In-memory analysis delivers a set of alerts to traders every 300 msec. • Enables the trader to examine strategy details in real time: Output: Real-Time Alerts

Slide 33

Slide 33 text

32 ScaleOut Software, Inc. • Measured a similar financial services application (back testing stock trading strategies on stock histories) • Hosted IMDG in Amazon EC2 using 75 servers holding 1 TB of stock history data in memory • IMDG handled a continuous stream of updates (1.1 GB/s) • Results: analyzed 1 TB in 4.1 seconds (250 GB/s) with linear scaling Sample Performance Results for PMI

Slide 34

Slide 34 text

33 ScaleOut Software, Inc. Benefits: • Enables use of standard Hadoop MapReduce for operational intelligence. • Accelerates data access by holding data in memory. • Analyzes and updates “live” data. • Reduces overheads of standard Hadoop distributions: • Batch scheduling • Disk access • Data shuffling • Mandatory key sorting • Enables new features, e.g.: • Global combining, optional sorting In-Memory MapReduce

Slide 35

Slide 35 text

34 ScaleOut Software, Inc. • A Hadoop distribution does not have to be installed unless HDFS is used. • The developer starts MapReduce applications from a remote workstation. • The IMDG automatically builds a reusable “invocation grid” of JVMs on the grid’s servers for PMI and ships the application’s jars. • Results are stored in the IMDG, HDFS, or optionally globally merged and returned to the remote workstation. Running MapReduce on an IMDG

Slide 36

Slide 36 text

35 ScaleOut Software, Inc. Run In-Memory MR with YARN • YARN transparently integrates batch and in-memory MapReduce into a single execution framework with shared access to HDFS. • For example, IMDG can transparently run Apache Hive in-memory. Example of ScaleOut hServer with Hortonworks Example of Hive Running on IMDG

Slide 37

Slide 37 text

36 ScaleOut Software, Inc. Run MapReduce as two PMI phases: • Data can be input from either the IMDG or an external data source. • Works with any input/output format compatible with the Apache distribution. • IMDG uses its data-parallel execution engine (PMI) to invoke the mappers and the reducers. • Eliminates batch scheduling overhead. • Intermediate results are stored within the IMDG. • Minimizes data motion between the mappers and reducers. • Allows optional sorting. • Output of a single reducer/combiner optionally can be globally merged. Implementing MapReduce

Slide 38

Slide 38 text

37 ScaleOut Software, Inc. • IMDG adds grid input format for accessing key/value pairs held in the IMDG. • MapReduce programs optionally can output results to IMDG with grid output format. • Grid Record Reader optimizes access to key/value pairs to eliminate network overhead. • Applications can access and update key/value pairs as operational data during analysis. Accessing IMDG Data for M/R

Slide 39

Slide 39 text

38 ScaleOut Software, Inc. • IMDG adds Dataset Record Reader (wrapper) to cache HDFS data during program execution. • Hadoop automatically retrieves data from IMDG on subsequent runs. • Dataset Record Reader stores and retrieves data with minimum network and memory overheads. • Tests with Terasort benchmark have demonstrated 11X faster access latency over HDFS without IMDG. Optional Caching of HDFS Data

Slide 40

Slide 40 text

39 ScaleOut Software, Inc. IMDG needs multiple in-memory storage models: • Named cache, optimized for rich semantics on large objects: • Property-based query • Distributed locking • Access from remote grids • Named map, optimized for efficient storage and bulk analysis (e.g., MapReduce): • Highly efficient object storage • Pipelined, bulk-access mechanisms In-Memory Storage Models

Slide 41

Slide 41 text

40 ScaleOut Software, Inc. In-Memory Concurrent Map: • Stores key/value pairs in chunks. • Allows CRUD operations on kvps. • Automatically organizes chunks into splits. • Uses per-split hash table to access keys and manage multi-valued keys. • Stores shuffled data set between mappers and reducers. • Pipelines chunks to mappers and from reducers. • Optionally uses memory mapped files to reduce access latency. • Provides support for sorting keys. In-Memory Storage Optimizations

Slide 42

Slide 42 text

41 ScaleOut Software, Inc. • MapReduce optimizations: • Optional sorting • Optional multicast of parameters to mappers • Optional O(logN) global combining (avoids single, sequential reducer) • Optional HDFS caching • Optional reuse of JVMs across jobs • Measured performance: • Startup times reduced to a few milliseconds • Word count benchmark shows 20X speedup. • Real-world example shows >40X speedup. • Current limitations: • No specific security for multi-tenancy • Intermediate data must fit in the IMDG In-Memory M/R Optimizations

Slide 43

Slide 43 text

42 ScaleOut Software, Inc. • Re-use in-memory context across MapReduce jobs: Accelerating Start-Up Times public static void main(String argv[]) throws Exception { //Configure and load the invocation grid InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid"). // Add JAR files as IG dependencies addJar("main-job.jar"). addJar("first-library.jar"). // Add classes as IG dependencies addClass(MyMapper.class). addClass(MyReducer.class). // Define custom JVM parameters setJVMParameters("-Xms512M -Xmx1024M"). load(); //Run 10 jobs on the same invocation grid for(int i=0; i<10; i++) { Configuration conf = new Configuration(); //The preloaded invocation grid is passed as the parameter to the job Job job = new HServerJob(conf, "Job number "+i, false, grid); //......Configure the job here......... //Run the job job.waitForCompletion(true); } //Unload the invocation grid when we are done grid.unload(); }

Slide 44

Slide 44 text

43 ScaleOut Software, Inc. • Online systems need operational intelligence on “live” data for immediate feedback. • Operational intelligence can be implemented using an IMDG integrated with data-parallel analysis. • IMDGs track “live” state: • Model real-world entities as a highly available object collection. • Enable updates to track changes. • Use data-parallel computation for immediate feedback with low latency. • Can run standard MapReduce. Recap

Slide 45

Slide 45 text

Thank you! 44

Slide 46

Slide 46 text

45 ScaleOut Software, Inc. • Mark class properties as indexes for query: • Define a query using these properties: Parallel Query Example (C#) class Stock { [SossIndex] public string Ticker { get; set; } public decimal TotalShares { get; set; } public decimal Price { get; set; }} NamedCache cache = CacheFactory.GetCache("Stocks"); var q = from s in cache.QueryObjects() where s.Ticker == "GOOG" || s.Ticker == "ORCL" select s; Console.WriteLine("{0} Stocks found", q.Count());

Slide 47

Slide 47 text

46 ScaleOut Software, Inc. • Create method to analyze each queried stock object: • Create method to pair-wise merge the results: Example of Analysis Code (C#) static decimal eval(Stock stock, StockCalcParams params) { return stock.Price * stock.TotalShares; } static decimal merge(decimal r1, decimal r2) { return r1 + r2; }

Slide 48

Slide 48 text

47 ScaleOut Software, Inc. • Run a parallel method invocation: Invoking the Parallel Analysis (C#) NamedCache cache = CacheFactory.GetCache("Stocks"); decimal valueOfSelectedStocks = (from s in cache.QueryObjects() where s.Ticker == "GOOG" || s.Ticker == "ORCL" select s) .Invoke(new StockCalcParams(…), new Func(eval)) .Merge(new Func(merge)); Console.WriteLine(“The value of selected stocks is {0}", valueOfSelectedStocks);

Slide 49

Slide 49 text

17TH ~ 18th NOV 2014 MADRID (SPAIN)