Upgrade to Pro — share decks privately, control downloads, hide ads and more …

New usage model for real-time analytics by Dr....

New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data Spain 2014

Operational systems manage our finances, shopping, devices and much more. Adding real-time analytics to these systems enables them to instantly respond to changing conditions and provide immediate, targeted feedback. This use of analytics is called “operational intelligence,” and the need for it is widespread. Financial trading applications must rapidly respond to fluctuating market conditions as market data flows through trading systems. E-commerce systems must reconcile orders with inventory changes on a second by second basis and need to quickly respond to shopping behavior to offer personalized recommendations. Smart grid monitoring systems need to continuously analyze telemetry from many sources to anticipate and respond to unexpected changes in power grids.

Big Data Spain

November 25, 2014
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. NEW USAGE MODEL FOR REAL-TIME ANALYTICS WILLIAM L. BAIN CEO

    AT SCALEOUT SOFTWARE, INC. SCALEOUT SOFTWARE, INC.
  2. Using In-Memory Models of Real-World Systems for Operational Intelligence Copyright

    © 2014 by ScaleOut Software, Inc. Big Data Hispano November 17, 2014 Bill Bain, CEO ([email protected])
  3. 2 ScaleOut Software, Inc. • What Is Operational Intelligence? •

    Example: Tracking Cable Viewers • Implementing OI Using an In-Memory Data Grid: • Distributing the Data Across a Cluster • Integrating Data-Parallel Analysis • Building an In-Memory Model • More Examples of In-Memory Models • Comparison to Spark and Storm • Implementing an Example in Financial Services • Using In-Memory Hadoop MapReduce for OI Agenda
  4. 3 ScaleOut Software, Inc. • Dr. William Bain, Founder &

    CEO • Career focused on parallel computing – Bell Labs, Intel, Microsoft • 3 prior start-ups, last acquired by Microsoft and product now ships as Network Load Balancing in Windows Server • ScaleOut Software develops and markets In-Memory Data Grids, software middleware for: • Scaling application performance and • Providing operational intelligence using • In-memory data storage and computing • Nine years in the market, 400 customers, 10,000 servers; sample customers: About the Speaker
  5. 4 ScaleOut Software, Inc. Goal: Provide immediate feedback to a

    system handling live data. A few examples: • Ecommerce: for personalized, real-time recommendations • Equity trading: to minimize risk during a trading day • Reservations systems: to identify issues, reroute, etc. • Credit cards & wire transfers: to detect fraud in real time • Smart grids: to optimize power distribution & detect issues Online Systems Need Operational Intelligence
  6. 5 ScaleOut Software, Inc. • Goals: • Make real-time, personalized

    upsell offers. • Immediately respond to service issues. • Track aggregate behavior to identify patterns, e.g.: • Total instantaneous incoming event rate • Most popular programs and # viewers by zip code • Requirements: • Track events from 10M cable boxes with 25K events/sec (2.2B/day). • Correlate, cleanse, and enrich events per rules (e.g. ignore fast channel switches, match channels to programs). • Be able to feed enriched events to recommendation engine within 5 sec. • Immediately examine any cable box (e.g., box status) & track statistics. Example: Track Cable TV Viewers ©2011 Tammy Bruce presents LiveWire
  7. 6 ScaleOut Software, Inc. Based on a simulated workload for

    San Diego metropolitan area: • Continuously correlates and enriches telemetry from 10M simulated set-top boxes (from synthetic load generator). • Processes more than 30K events/second. • Enriches events with program information every second. • Tracks aggregate statistics (e.g., top 10 programs by zip code) every 10 secs. The Result: An OI Platform Real-Time Dashboard
  8. 7 ScaleOut Software, Inc. Big Data Analytics Real-Time vs. Batch

    Analytics Static data sets Petabytes Disk storage Minutes to hours Best uses: • Analyzing warehoused data • Mining for long- term trends Live data sets Gigabytes to terabytes In-memory storage Seconds to minutes Best uses: • Tracking live data • Immediately identifying trends and capturing opportunities • Providing immediate feedback Analytics Server hServer Hadoop IBM Teradata SAS SAP Real-Time Batch Real-time “Operational Intelligence” Batch “Business Intelligence”
  9. 8 ScaleOut Software, Inc. • Operational intelligence can co-exist with

    business intelligence: • Processes streaming data close to its sources. • Provides real-time, “tactical” feedback (e.g., recommendations, alerts). • Transforms data for storage in the data warehouse (ETL). • Data warehouse provides “strategic” guidance. • Using the same tool set (e.g., Hadoop MapReduce) lowers TCO: • Leverages common skill set. • Simplifies design (e.g., loading data into HDFS). Integrated View of Analytics
  10. 9 ScaleOut Software, Inc. • To keep up with fast

    growing “live” workloads & maintain fast response times: • Track state of entities within a live system. • Reliably process updates to data set in real-time. • To identify and respond to trends in fast-changing data: • Enrich & evaluate “live” data set in real time. • Respond to identified patterns within seconds. Challenges for Operational Intelligence 0 50 100 150 200 250 300 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Millions Growth in Web Servers Source: Netcraft 0 500 1000 1500 2000 2500 3000 3500 4000 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Exebytes Growth in “Big Data” “More data has been created in the past three years than in the past 40,000.”
  11. 10 ScaleOut Software, Inc. • In-memory data grid (IMDG) holds

    active entities undergoing state changes in memory. • Backing store optionally holds large population of entities. • IMDG processes incoming stream of state changes. • Analytics engine examines entities in real time and generates alerts within seconds as needed. In-Memory Architecture for Operational Intelligence
  12. 11 ScaleOut Software, Inc. In-Memory Data Grid (IMDG) stores “live”

    data in a cluster: • Fits in the business logic layer: • Follows object-oriented view of data (vs. relational view). • Stores collections of Java/.NET/C++ objects shared by multiple clients. • Uses create/read/update/delete and query APIs to access data. • Implemented across a cluster of servers or VMs: • Scales storage and throughput by adding servers. • Provides high availability in case a server fails. In-Memory Data Grid
  13. 12 ScaleOut Software, Inc. • IMDG’s collections of objects act

    like process collections: • Unstructured, typically instances of a class (stored as serialized blobs) • Individually accessible / update-able • IMDG adds attributes: • Accessible by global key • Query-able by properties • Highly available • Optional timeouts • Distributed locking • Integration with a backing store • Optional dependency relationships • Asynchronous event handling IMDGs Use Object-Oriented Model Basic “CRUD” APIs: • Create(key, obj, tout) • Read(key) • Update(key, obj) • Delete(key) and… • Lock(key) • Unlock(key) Object key
  14. 13 ScaleOut Software, Inc. In-Memory, Data-Parallel Computing • Integrates with

    IMDG data storage to minimize data motion. • Ex.: Parallel Method Invocation (PMI), an object-oriented version of data-parallel computing from the HPC community: • Selects objects using a parallel query on data hosted in the IMDG. • Runs user-defined methods in parallel across the cluster and merges results. Analyze Data (Eval) Combine Results (Merge) In-Memory Data Grid Runs Data-Parallel Computation.
  15. 14 ScaleOut Software, Inc. Achieving Linear Speedup Avoid data motion

    (network or disk I/O) which limits throughput:
  16. 15 ScaleOut Software, Inc. Object-oriented model tracks and analyzes real-world

    entities: In-Memory Model of “Live” Entities In-Memory State in “IMDG” NoSQL Storage Real-Time Data Parallel Analysis
  17. 16 ScaleOut Software, Inc. • Each cable box is represented

    as an object in the IMDG: • Object holds raw & enriched event streams, viewer parameters, and statistics. • IMDG captures incoming events by updating objects. • IMDG uses data-parallel computation to: • immediately enrich box objects to generate alerts to recc. engine, and • continuously collect and report global statistics. Example: Cable Set-Top Boxes
  18. 17 ScaleOut Software, Inc. Fast map/reduce reconciles inventory and order

    systems for an online retailer: • Challenge: Inventory and online order management are handled by different applications. • Reconciled once per day. • Inaccurate orders reduces margins. • Solution: • Host SKUs in IMDG updated in real time by order & inventory systems. • Use MapReduce to reconcile in two minutes. • Enables real-time reconciliation to ensure accurate orders. Example in Ecommerce: Inventory Management
  19. 18 ScaleOut Software, Inc. • IMDG holds customer information for

    active Web users. • IMDG saves/retrieves customer information from backing store. • Web browsers send activity information to analytics engine. • IMDG updates customer history and preferences. • Analytics engine identifies browsing and buying patterns. • Analytics engine makes suggestions in real-time. Also sends email follow-ups. Example: Web Shopping
  20. 19 ScaleOut Software, Inc. Brick and mortar stores use OI

    to compete with online experience: • IMDG tracks opt-in customers to make recommendations. • RFID tags identify product selection and availability in showroom. • Analytics engine sends real-time advisories to sales staff via tablet. Example: Retail Shopping
  21. 20 ScaleOut Software, Inc. Focus: accelerating business intelligence using in-memory

    computing: • In-memory computing to accelerate and extend Hadoop MapReduce using data-parallel operators in Scala. • Stores data as “resilient distributed datasets” (RDDs): • Distributed across cluster • Immutable • Hold data from/output to HDFS. • Manages data stream as a sequence of RDDs. • Comparison to IMDG: • Not designed for operational systems: • Lacks high availability (uses lineage). • Intended for data-parallel operations: • Lacks CRUD APIs on individual objects. Comparison: IMDGs to Spark
  22. 21 ScaleOut Software, Inc. • Focus: continuous processing of input

    streams • Storm implements pipelined execution of tasks by “bolts” on incoming data streams. • Streams can be distributed to bolts with configurable mappings. • Developer controls the number of tasks per bolt. • Storm uses a centralized master node and Zookeeper for fault-tolerance. • Issues: • Managing global state • Minimizing data motion • Complexity / tuning Comparison to Storm
  23. 22 ScaleOut Software, Inc. • Hedge fund tracks a set

    of hedging strategies: • Strategies can cover various market sectors, such as high-tech, automotive, energy, consumer, real estate, etc. • Each strategy contains list of holdings and rules for managing the holdings (such as target allocations). • Updates to market data continuously arrive during the trading day. • The challenge: update and analyze a large population of hedging strategies to immediately alert traders. Implementing an Example in FinServ
  24. 23 ScaleOut Software, Inc. • The IMDG holds hedging strategies

    as an object-oriented collection. • Updates to market data are managed as a series of snapshot objects. • The IMDG performs repeated data-parallel analysis on hedging strategies to generate alerts. • Merges alerts and feeds to traders in real time. • IMDG automatically and dynamically scales its throughput to handle new hedging strategies by adding servers. In-Memory Model
  25. 24 ScaleOut Software, Inc. Step 1: Select all objects using

    parallel query of strategy objects: • Query spec matches data’s object-oriented properties. • Selected objects are fed to the analysis engine on each local server. Implementing the Analysis
  26. 25 ScaleOut Software, Inc. Java Example: Parallel Query public class

    Portfolio { private long id; private Set<Stock> longPositions; private Set<Stock> shortPositions; private double totalValue; private Region region; private boolean alerted; // alert for trading @SossIndexAttribute // query-able property public double getTotalValue() {…} @SossIndexAttribute // query-able property public Region getRegion() {…} public Set<Long> evalPositions(MarketSnapshot ms) {…}; } NamedCache pset = CacheFactory.getCache(“portfolios"); Set<Portfolio> res = pset.queryObjects(Portfolio.class, and(greaterThan(“totalValue”, 1000000), equals(“region”, Region.US)));
  27. 26 ScaleOut Software, Inc. Step 2: Create parallel methods to

    update and analyze the queried collection of hedging strategies: • “Eval” method applies market snapshot to an instance of a strategy object: • Compare to a MapReduce mapper; adds an input parameter. • Updates the strategy object’s positions. • Analyzes the positions for a deviation from allowed rules. • Optionally generates an alert. • “Merge” method combines alerts across the collection of strategies: • Compare to a MapReduce combiner. • Uses binary combining. • Is applied globally to the object collection by the IMDG (unlike a Mapreduce reducer). • Note: both methods access hydrated objects; avoid need for CRUD access. Implementing the Analysis
  28. 27 ScaleOut Software, Inc. • Create method to analyze a

    queried portfolio and another method to pair-wise merge the result sets of alerted portfolios: Java Example: Parallel Method Invocation public class PortfolioAnalysis implements Invokable<Portfolio, MarketSnapshot, Set<Long>> { public Set<Long> eval(Portfolio p, MarketSnapshot ms) throws InvokeException { // update portfolio and return id if alerted: return p.evalPositions(ms); } public Set<Long> merge(Set<Long> set1, Set<Long> set2) throws InvokeException { set1.addAll(set2); return set1; // merged set of alerted portfolio ids }}
  29. 28 ScaleOut Software, Inc. • Run a parallel method invocation

    on a queried set of portfolios and return set of ids for alerted portfolios: Java Example: Parallel Method Invocation NamedCache pset = CacheFactory.getCache(“portfolios"); InvokeResult alertedPortolios = pset.invoke( PortfolioAnalysis.class, Portfolio.class, and(greaterThan(“totalValue”, 1000000), // query spec equals(“region”, Region.US)), marketSnapshot, // parameters ... ); System.out.println("The alerted portfolios are" + alertedPortfolios.getResult());
  30. 29 ScaleOut Software, Inc. • IMDG ships user’s code and

    libraries to its servers. • IMDG automatically schedules analysis operations across all grid servers and cores: • The analysis runs on all objects selected by the parallel query. • Each grid server analyzes its locally stored objects to minimize data motion. • Parallel execution ensures fast completion time: • IMDG automatically distributes workload across servers/cores. • Scaling the IMDG automatically handles larger data sets. Running the Analysis
  31. 30 ScaleOut Software, Inc. • The IMDG automatically merges all

    analysis results: • The IMDG first merges all results within each grid server in parallel. • It then merges results across all grid servers to create one combined result. • Efficient parallel merge minimizes the delay in combining all results. • The IMDG delivers the combined result to the invoking application as one object. Merging the Results
  32. 31 ScaleOut Software, Inc. • In-memory analysis delivers a set

    of alerts to traders every 300 msec. • Enables the trader to examine strategy details in real time: Output: Real-Time Alerts
  33. 32 ScaleOut Software, Inc. • Measured a similar financial services

    application (back testing stock trading strategies on stock histories) • Hosted IMDG in Amazon EC2 using 75 servers holding 1 TB of stock history data in memory • IMDG handled a continuous stream of updates (1.1 GB/s) • Results: analyzed 1 TB in 4.1 seconds (250 GB/s) with linear scaling Sample Performance Results for PMI
  34. 33 ScaleOut Software, Inc. Benefits: • Enables use of standard

    Hadoop MapReduce for operational intelligence. • Accelerates data access by holding data in memory. • Analyzes and updates “live” data. • Reduces overheads of standard Hadoop distributions: • Batch scheduling • Disk access • Data shuffling • Mandatory key sorting • Enables new features, e.g.: • Global combining, optional sorting In-Memory MapReduce
  35. 34 ScaleOut Software, Inc. • A Hadoop distribution does not

    have to be installed unless HDFS is used. • The developer starts MapReduce applications from a remote workstation. • The IMDG automatically builds a reusable “invocation grid” of JVMs on the grid’s servers for PMI and ships the application’s jars. • Results are stored in the IMDG, HDFS, or optionally globally merged and returned to the remote workstation. Running MapReduce on an IMDG
  36. 35 ScaleOut Software, Inc. Run In-Memory MR with YARN •

    YARN transparently integrates batch and in-memory MapReduce into a single execution framework with shared access to HDFS. • For example, IMDG can transparently run Apache Hive in-memory. Example of ScaleOut hServer with Hortonworks Example of Hive Running on IMDG
  37. 36 ScaleOut Software, Inc. Run MapReduce as two PMI phases:

    • Data can be input from either the IMDG or an external data source. • Works with any input/output format compatible with the Apache distribution. • IMDG uses its data-parallel execution engine (PMI) to invoke the mappers and the reducers. • Eliminates batch scheduling overhead. • Intermediate results are stored within the IMDG. • Minimizes data motion between the mappers and reducers. • Allows optional sorting. • Output of a single reducer/combiner optionally can be globally merged. Implementing MapReduce
  38. 37 ScaleOut Software, Inc. • IMDG adds grid input format

    for accessing key/value pairs held in the IMDG. • MapReduce programs optionally can output results to IMDG with grid output format. • Grid Record Reader optimizes access to key/value pairs to eliminate network overhead. • Applications can access and update key/value pairs as operational data during analysis. Accessing IMDG Data for M/R
  39. 38 ScaleOut Software, Inc. • IMDG adds Dataset Record Reader

    (wrapper) to cache HDFS data during program execution. • Hadoop automatically retrieves data from IMDG on subsequent runs. • Dataset Record Reader stores and retrieves data with minimum network and memory overheads. • Tests with Terasort benchmark have demonstrated 11X faster access latency over HDFS without IMDG. Optional Caching of HDFS Data
  40. 39 ScaleOut Software, Inc. IMDG needs multiple in-memory storage models:

    • Named cache, optimized for rich semantics on large objects: • Property-based query • Distributed locking • Access from remote grids • Named map, optimized for efficient storage and bulk analysis (e.g., MapReduce): • Highly efficient object storage • Pipelined, bulk-access mechanisms In-Memory Storage Models
  41. 40 ScaleOut Software, Inc. In-Memory Concurrent Map: • Stores key/value

    pairs in chunks. • Allows CRUD operations on kvps. • Automatically organizes chunks into splits. • Uses per-split hash table to access keys and manage multi-valued keys. • Stores shuffled data set between mappers and reducers. • Pipelines chunks to mappers and from reducers. • Optionally uses memory mapped files to reduce access latency. • Provides support for sorting keys. In-Memory Storage Optimizations
  42. 41 ScaleOut Software, Inc. • MapReduce optimizations: • Optional sorting

    • Optional multicast of parameters to mappers • Optional O(logN) global combining (avoids single, sequential reducer) • Optional HDFS caching • Optional reuse of JVMs across jobs • Measured performance: • Startup times reduced to a few milliseconds • Word count benchmark shows 20X speedup. • Real-world example shows >40X speedup. • Current limitations: • No specific security for multi-tenancy • Intermediate data must fit in the IMDG In-Memory M/R Optimizations
  43. 42 ScaleOut Software, Inc. • Re-use in-memory context across MapReduce

    jobs: Accelerating Start-Up Times public static void main(String argv[]) throws Exception { //Configure and load the invocation grid InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid"). // Add JAR files as IG dependencies addJar("main-job.jar"). addJar("first-library.jar"). // Add classes as IG dependencies addClass(MyMapper.class). addClass(MyReducer.class). // Define custom JVM parameters setJVMParameters("-Xms512M -Xmx1024M"). load(); //Run 10 jobs on the same invocation grid for(int i=0; i<10; i++) { Configuration conf = new Configuration(); //The preloaded invocation grid is passed as the parameter to the job Job job = new HServerJob(conf, "Job number "+i, false, grid); //......Configure the job here......... //Run the job job.waitForCompletion(true); } //Unload the invocation grid when we are done grid.unload(); }
  44. 43 ScaleOut Software, Inc. • Online systems need operational intelligence

    on “live” data for immediate feedback. • Operational intelligence can be implemented using an IMDG integrated with data-parallel analysis. • IMDGs track “live” state: • Model real-world entities as a highly available object collection. • Enable updates to track changes. • Use data-parallel computation for immediate feedback with low latency. • Can run standard MapReduce. Recap
  45. 45 ScaleOut Software, Inc. • Mark class properties as indexes

    for query: • Define a query using these properties: Parallel Query Example (C#) class Stock { [SossIndex] public string Ticker { get; set; } public decimal TotalShares { get; set; } public decimal Price { get; set; }} NamedCache cache = CacheFactory.GetCache("Stocks"); var q = from s in cache.QueryObjects<Stock>() where s.Ticker == "GOOG" || s.Ticker == "ORCL" select s; Console.WriteLine("{0} Stocks found", q.Count());
  46. 46 ScaleOut Software, Inc. • Create method to analyze each

    queried stock object: • Create method to pair-wise merge the results: Example of Analysis Code (C#) static decimal eval(Stock stock, StockCalcParams params) { return stock.Price * stock.TotalShares; } static decimal merge(decimal r1, decimal r2) { return r1 + r2; }
  47. 47 ScaleOut Software, Inc. • Run a parallel method invocation:

    Invoking the Parallel Analysis (C#) NamedCache cache = CacheFactory.GetCache("Stocks"); decimal valueOfSelectedStocks = (from s in cache.QueryObjects<Stock>() where s.Ticker == "GOOG" || s.Ticker == "ORCL" select s) .Invoke(new StockCalcParams(…), new Func<Stock, StockCalcParams, decimal>(eval)) .Merge(new Func<decimal, decimal, decimal>(merge)); Console.WriteLine(“The value of selected stocks is {0}", valueOfSelectedStocks);