Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Build Streaming Data Applications: Evaluating the Top Contenders 1

How to Build Streaming Data Applications: Evaluating the Top Contenders 1

Originally presented at:

British Computer Society (BCS) SPA-287, London, UK, 3 March 2015
http://www.eventbrite.co.uk/e/spa-287-how-to-build-streaming-data-applications-evaluating-the-top-contenders-tickets-15735307729/

VeryFatBoy

March 03, 2015
Tweet

More Decks by VeryFatBoy

Other Decks in Technology

Transcript

  1. page HOW TO BUILD STREAMING DATA APPLICATIONS: EVALUATING THE TOP

    CONTENDERS Akmal B. Chaudhri about.me/akmalchaudhri
  2. page VOLTDB OVERVIEW Mike Stonebraker Founded in 2009 by database

    luminary FAST World Record Cloud Benchmark: YCSB (Yahoo Cloud Serving Benchmark) - 2.4m million tps (transactions per second) Other Stonebraker Companies Customers 3 Technology •  In-Memory (but data is durable to disk) •  Scale-Out shared-nothing architecture •  Reliability and fault tolerance •  SQL + Java with ACID •  Hadoop and data warehouse integration •  Open source and commercially licensed (24X7) © 2015 VoltDB PROPRIETARY
  3. page VOLTDB BENCHMARK ON AMAZON VIRTUAL AND IBM SOFTLAYER BARE-METAL

    SERVERS •  Yahoo Cloud Serving Benchmark (YCSB) is a popular industry-standard benchmark for cloud databases •  AWS – virtualized servers •  SoftLayer - bare-metal servers •  Workload “B” - 95% reads with 5% updates. •  Results: Best in class cloud performance (run in the cloud)! •  AWS - 285k tps for 3 nodes scaling linearly to 724k tps for a 12 node cluster •  IBM SoftLayer - 1.02 million tps for 3 nodes scaling linearly to 2.4 million tps for a 12 node cluster SoftLayer AWS SoftLayer: Update and Read Latency Latency (ms) Throughput (ops/sec) © 2015 VoltDB PROPRIETARY
  4. page PREDICTION 5 All businesses will compete on their ability

    to make decisions “in the moment” using Fast Data. © 2015 VoltDB PROPRIETARY
  5. page FAST DATA SOURCES AND DRIVERS Mobile IoT Social Sensors

    Logs Data is doubling every two years •  26 billion connected devices by 2020 (Gartner 2014) •  37% of most data will be processed at the edge in milliseconds (Cisco IoT Study 12/11/14) Mobile IoT 6 © 2015 VoltDB PROPRIETARY
  6. page Mobile Billing and rights management, subscriber marketing, etc. IoT,

    Energy, Sensor Smart grid/meters, asset tracking & management Personalized Targeting Ad optimization, audience segmenting Capital Markets Risk, market data management, customer mgt Infrastructure Data pipeline, system performance, streaming ETL EVERY COMPANY HAS FAST DATA PROBLEMS UK Smart Meter 7 VoltDB Customers © 2015 VoltDB PROPRIETARY
  7. page FAST DATA IS A COMPETITIVE ADVANTAGE TODAY! Instant insight

    Instant action Instant awareness 8 * VoltDB customers “Event triggered, real-time recommendations based on customer behavior have 10-15 times the response rates than mass marketing” “We get competitive advantage by analyzing device and user data to create an interactive and personalized consumer experience across all devices.” “Real time contextual offers increase offer uptake rates by 75% and data revenues by 15%.” * * © 2015 VoltDB PROPRIETARY
  8. page TRADITIONAL RDBMS •  Heavy Overhead •  1000s of concurrent

    versions •  Contention for locked records •  Contention for latching on lock table •  Index bottlenecks •  Disk I/O bottlenecks •  Architecture limits scaling © 2015 VoltDB PROPRIETARY 9
  9. page Collect' Explore' (Data'Science)' Analyze' Act' (Discoveries/' Op:miza:ons)' Big data

    ecosystem has several components © 2015 VoltDB PROPRIETARY 12
  10. page DATA ARCHITECTURE FOR FAST + BIG DATA Enterprise Apps

    ETL CRM ERP Etc. Data Lake (HDFS, etc.) BIG DATA SQL on Hadoop Map Reduce Exploratory Analytics BI Reporting Fast Operational Database FAST DATA Export Ingest / Interactive Real-time Analytics Fast Serve Analytics Decisioning 13 © 2015 VoltDB PROPRIETARY
  11. page Calculations Serving of Results Real Time, Per Event, Interactive

    VOLTDB AND FAST DATA PIPELINE 14 © 2015 VoltDB PROPRIETARY
  12. page IN THE BIG CORNER Systems facilitating exploration and analytics

    of large collections. 15 Example Technologies Columnar OLAP warehouses Hadoop Ecosystem •  MapReduce •  Hive, Pig •  SQL.next: Impala, Drill, Shark Example Applications •  User segmentation & pre-scoring •  Seasonal trending •  Recommendation matrices •  Building search indexes •  Data Science: statistical clustering, machine learning © 2015 VoltDB PROPRIETARY
  13. page IN THE FAST CORNER Systems facilitating real time ingest,

    analytics and decisions against incoming streams of events. 16 Example Technologies •  Streaming frameworks (e.g. Spark) •  Fast OLAP (e.g. HANA) •  Fast OLTP (e.g. VoltDB) Example Applications •  Micro-personalization •  Recommendation serving •  Alerting/alarming •  Operational monitoring •  Data enrichment (ETL elimination) •  High throughput authorization •  Ex: API quota enforcement © 2015 VoltDB PROPRIETARY
  14. page TYPICAL FAST DATA QUESTIONS 17 Hadoop& Volume' SQL&/&OLAP& Data'Science'

    Fast& Velocity' •  Is the fast layer streaming? •  It is often more like fast OLTP •  How do the pieces communicate? •  OLAP analytics from Big -> Fast •  New events from Fast -> Big •  Where do “analytics” belong? •  Analytics per-event: with Fast •  Analytics across history: with Big •  Are streaming frameworks equivalent? •  Traditional SQL CEP (Esper, Streambase) •  Tuple DAGs (Storm) •  Window processors on Hadoop (Spark) & © 2015 VoltDB PROPRIETARY
  15. page HOW TO SOLVE IT* 18 *"With"admiring"credit"to"G."Polya" Considering'Data' Considering'Processing' What&are&the&types&of&

    data&to&be&managed&in& fast&data&applica>ons?& How&does&data&flow& through&fast&data& applica>ons?& What&are&the& calcula>ons&&&analy>cs& that&are&necessary?& © 2015 VoltDB PROPRIETARY
  16. page Data Temporality Incoming events Click stream, tick stream, sensors,

    metrics Real-Time Analytic Results Event metadata Device version, location, user profiles, point-of-interest data OLAP Analytics Used in Real-Time Decisions Responses/side effects Examples Event Stream Persistent (Queryable) Persistent (Look-Ups) Outgoing events Persistent (Look-Ups) Event Stream Event Stream Counters, streaming aggregates, Time-series rollups Scoring models, seasonal usage, demographic trends Policy enforcement decisions, personalization recommendations Enriched, filtered, correlated transform of input feed © 2015 VoltDB PROPRIETARY 19
  17. page SOURCES OF STATE 1.  Analytics outputs must be query-able.

    2.  “Lookup tables” to create groupings for analytics and to supply enrichment data. 3.  Session managements: grouping, filtering and aggregating create intermediate state. 20 © 2015 VoltDB PROPRIETARY
  18. page DATA FLOWS Real-time Analytics •  Streaming summaries for operations

    •  KPI measurement •  Analytics for apps 22 Real-Time Analytics © 2015 VoltDB PROPRIETARY
  19. page DATA FLOWS 23 Fast Request/Response (and side effects) • 

    Mobile Authorization •  Campaign Evaluation •  Quota Enforcement •  Micro-Personalization •  Recommendation Serving Request/ Response © 2015 VoltDB PROPRIETARY
  20. page DATA FLOWS Data Pipelines •  Data enrichment •  Sessionization

    and re-assembly of incoming events. •  Correlation (by time, location, identity) •  Filtering 24 Pipeline Data Lake © 2015 VoltDB PROPRIETARY
  21. page FAST DATA STACK Applications, Message Queues, Data Sources Ingest

    Analyze Decide •  Counters •  Aggregations •  Time series •  Statistics •  Store results •  Query and recombine •  Fast serving •  Per-event policy evaluations •  Responses (synchronous): authorization, personalization •  Side-effects (asynchronous): alerts, alarms Export & Pipeline © 2015 VoltDB PROPRIETARY 27
  22. page 28 Applications, Message Queues, Data Sources Ingest Analyze Decide

    Counters Aggregations Time series Statistics Store results Query and recombine Fast serving Per-event policy evaluations Responses (synchronous) Side-effects (asynchronous) Export & Pipeline APACHE-ISH TECHNOLOGY STACK Kafka / RabbitMQ Storm, Flume, Sqoop Storm + Serving Layer Spark + Serving Layer Cassandra, HBase Hadoop, Message queues © 2015 VoltDB PROPRIETARY
  23. page 29 Applications, Message Queues, Data Sources Ingest Analyze Decide

    Counters Aggregations Time series Statistics Store results Query and recombine Fast serving Per-event policy evaluations Responses (synchronous) Side-effects (asynchronous) Export & Pipeline VOLTDB TECHNOLOGY STACK Kafka / RabbitMQ VoltDB SQL, Java for Analytics Transactions / ACID Hadoop, Message queues © 2015 VoltDB PROPRIETARY
  24. page 31 Applications, Message Queues, Data Sources Ingest Analyze Decide

    Counters Aggregations Time series Statistics Store results Query and recombine Fast serving Per-event policy evaluations Responses (synchronous) Side-effects (asynchronous) Export & Pipeline STREAM TECHNOLOGY STACK © 2015 VoltDB PROPRIETARY
  25. page 32 Applications, Message Queues, Data Sources Ingest Analyze Decide

    Counters Aggregations Time series Statistics Store results Query and recombine Fast serving Per-event policy evaluations Responses (synchronous) Side-effects (asynchronous) Export & Pipeline OLAP TECHNOLOGY STACK © 2015 VoltDB PROPRIETARY
  26. page Applications & Streams Logs, Sensors, Meter Readings, IoT, Location

    Real-Time Applications Message Queue Ingest Kafka Loader CSV loaders C++, C#, PHP, Python Java (and others) Export CSV Data Thrift Messages JDBC HTTP Local File Extensible Connectors SQL Views Java Analyze ACID Txns State Decide Downstream Pipeline Hadoop Data Warehouse Message Queue STREAMING DATA PIPELINE © 2015 VoltDB PROPRIETARY 33
  27. page THREE FAST DATA APPLICATION PATTERNS •  Real-Time Analytics • 

    Real-time analytics for operations •  Real-time KPI measurement •  Real-time analytics for apps •  Data Pipelines •  Streaming data enrichment •  Sessionization / re-assembly •  Correlation (by time, by location, by id) •  Filtering •  Pre-aggregation 35 •  Fast Request/Response •  Mobile Authorization •  Campaign Authorization •  Fast API Quota Enforcement •  Micro-Personalization •  Recommendation Serving © 2015 VoltDB PROPRIETARY
  28. page VOLTDB: REAL-TIME ANALYTICS 36 VoltDB Metadata (Dimension table) Session

    state (Fact table) •  Operational analytics and monitoring •  RT analytics enabling user- facing applications •  KPI for internal BI/Dashboards •  In-memory MPP SQL over ODBC/JDBC •  Cheap + correct materialized views for streaming aggregations SQL, Views Ingest © 2015 VoltDB PROPRIETARY
  29. page VOLTDB: DATA PIPELINES WITH EXPORT 37 VoltDB Metadata (Dimension

    table) Session state (Fact table) •  Filtering (ex: only RFID / iBeacon readings that show change from previous location). •  Sessionization •  Common version re-writing •  Data enrichment •  MPP streaming Export •  Row data, Thrift messages, CSV •  OLAP, HDFS and message queues Export © 2015 VoltDB PROPRIETARY
  30. page VOLTDB: REQUEST/RESPONSE DECISIONS 38 •  Authorization •  RT balance

    checks, quota enforcement •  Personalization and Recommendation Serving •  Combine pre-score with immediate context •  Fully ACID transaction model. •  Thousands to Millions per second •  At less than 5ms latencies Metadata& (Dimension&table)& Session&state& (Fact&table)& ACID&Transac>ons& © 2015 VoltDB PROPRIETARY
  31. page VOLTDB V5.0 – ACCELERATING FAST DATA APPLICATION DEVELOPMENT • 

    Hadoop/Big Data Ecosystem Integrations •  Fast Data Pipeline Sample Applications •  Ease of Database Development (traditional API) •  VoltDB Management Center (VMC) •  Updated Hortonworks HDP Certification 40 © 2015 VoltDB PROPRIETARY
  32. page FAST DATA INTEGRATIONS - IMPORTERS •  Kafka Loader • 

    Subscribe to a Kafka topic and insert each message into a VoltDB Table •  JDBC Loader •  Load a JDBC result set into a VoltDB Table •  Vertica Udx •  User-defined function to load Vertica result sets into a VoltDB Table •  Apache Hive and Apache Pig •  Hadoop OutputFormat to load Hive and Pig result sets into VoltDB © 2015 VoltDB PROPRIETARY 41
  33. page FAST DATA INTEGRATIONS - EXPORTERS •  HDFS Export • 

    Hadoop export via WebHDFS and HttpFS •  HTTP Export •  Delivery and Alerting via HTTP post/get •  Kafka Export, RabbitMQ Export •  Message queue delivery •  Export format configurable •  Avro, CSV, TSV, more coming… © 2015 VoltDB PROPRIETARY 42
  34. page FAST DATA PIPELINE SAMPLE APPLICATION •  Streaming Data, Real-time

    Analytics •  Export to Hadoop •  Export to OLAP (Vertica, others) •  Place historical decision making intelligence into VoltDB •  Closed Loop, via Hive, Pig OutputFormat or Vertica Udx •  Download: https://github.com/VoltDB/app-fastdata •  And see our blog posts: http://voltdb.com/blog/fast-data-look-voltdb-sample-app © 2015 VoltDB PROPRIETARY 43
  35. page LAMBDA ARCHITECTURE SAMPLE APPLICATION •  Type of application: Real-time

    analytics •  Demonstrates how to simplify the “Speed Layer” •  Using VoltDB, developers can replace both the streaming and the operational data store portions of the speed layer. •  Less code, greatly reduced complexity •  Improving the Lambda Architecture •  Perform real-time analytics AND react, per event, to the incoming data stream •  Try it yourself: http://voltdb.com/community/applications HOW MANY UNIQUE USERS INTERACTED WITH MY APP TODAY? © 2015 VoltDB PROPRIETARY 44
  36. page VOLTDB MANAGEMENT CENTER (VMC) A browser-based management tool for

    monitoring, examining, and querying a running VoltDB database © 2015 VoltDB PROPRIETARY 45
  37. page 60 Million meters under management, saving millions in efficiency,

    reduced waste VOLTDB DELIVERS SUPERIOR CUSTOMER VALUE Customers Business Value Internet Service Provider Discover 100% of DoS attacks, and improved response time by 97% Communications Service Provider Improved infrastructure utilization by 150% Online Game Analytics Increased free-to-pay conversion rate by 30% Mobile Network Management Saves $0.5 million/customer installation; unlimited scale in the cloud Mobile Ad Service Provider OpEx – 93% reduction in servers (100 to 7) Saved millions in ad budget overages 48 Smart Meter, Energy Management © 2015 VoltDB PROPRIETARY
  38. page TRY V5.0 TODAY FOR FREE •  VoltDB Enterprise Edition

    •  Production-ready •  Fully durable, highly available •  Commercial license, fully supported •  http://voltdb.com/download/software •  Sample apps (in a Docker container) •  http://voltdb.com/community/demo •  VoltDB Community Edition – open source •  http://github.com/voltdb VoltDB runs over 6 BILLION transactions/day in production! © 2015 VoltDB PROPRIETARY 50
  39. Capability Spark,Streaming Storm TIBCO,Streambase IBM,Streams Google,Dataflow Amazon,Kinesis VoltDB Focus Micro&Batching&for&Hadoop

    Infrastructure&for&data& capture Complex&Event&Processing Stream&processing&and& analytics&without&queries Next&gen&MapReduce&in&the& cloud Infrastructure&for&data& capture Stream&processing,&analytics&with& queries,&and&realCtime&decision& making Programming&Model Java,&Scala Clojure,&Java,&Ruby,&Python SQL Proprietary&C&Stream& Processing&Language&(SPL) Java Java Java,&Relational,&SQL,&ACIDC compliant Latency&(milliseconds) >&&1,000&milliseconds milliseconds 1&millisecond 1&millisecond >&&2,000&milliseconds 35C100&milliseconds 1&milllisecond Data&Capture/Ingestion Batch ! ! ! ! ! ! Stateful,Operation X X X X X X ! Ad,hoc,queries Interactive,SQL X X X X X X ! Analytics,w/o,Queries ! with&add&on&DDLs ! ! ! ! ! Analytics,with,queries,and,perKevent, decision,making X X X X X X ! Real&time&Data&Enrichment Using&metadata&to&enrich,&denormalize,&etc.,& incoming&event&streams X X X X X X ! Apply&OLAP&results&to&real&time&data&stream X X X ! X X ! ScaleCout&architecture ! ! X ! ! ! ! Reliability:&ability&to&persist&data X X X X X ! Fault&Tolerant ! ! ! ! ! ! Requires&Zookeeper&for&HA Reliability:&ability&to&persist&data X X ! ! X X ! Cluster&&&Resource&Management Need&to&addCon&Zookeeper Need&to&addCon&Zookeeper;& supports&YARN BuiltCIn BuiltCIn BuiltCIn BuiltCIn BuiltCIn Support Cloudera Hortonworks TIBCO IBM Google Amazon VoltDB Output&(OLAP&Integration) HDFS,&Flume,&Kafka,,&ZeroMQ HDFS,&Kafka,&Redis,&RDBMS HDFS,&CSV,&IBM&Netezza,&HP& Vertica,&&Microsoft,&Oracle,& Sybase HDFS,&CSV,&IBM&Netezza,&HP& Vertica,&&Microsoft,&Oracle,& Sybase Google Amazon HDFS,&Kafka,&RabbitMQ,&CSV,& Netezza,&HP&Vertica,&JDBC Available&as&Open&Source Yes,&Apache&license Yes,&Apache&license X X X X Yes,&AGPL&License Comparing,Fast,Data,Application,Platforms:,From,Simple,Streaming,to,RealKTime,Interaction,with,Decision,Making Ingestion&&&&C>&&&Analytics&&w/o&Queries&&&&&C>&&&&&Analytics&with&queries&&&&&C&>&&&&Data&Enrichment&C>&&&Real&time&Decisions Fast,data,applications,three,unique,requirements:,rapid,data,ingestion,,realKtime,analytics,on,streaming,data,,and,per,event,realKtime,decisions