$30 off During Our Annual Pro Sale. View Details »

How to Build Streaming Data Applications: Evaluating the Top Contenders 1

How to Build Streaming Data Applications: Evaluating the Top Contenders 1

Originally presented at:

British Computer Society (BCS) SPA-287, London, UK, 3 March 2015
http://www.eventbrite.co.uk/e/spa-287-how-to-build-streaming-data-applications-evaluating-the-top-contenders-tickets-15735307729/

VeryFatBoy

March 03, 2015
Tweet

More Decks by VeryFatBoy

Other Decks in Technology

Transcript

  1. page
    HOW TO BUILD STREAMING DATA
    APPLICATIONS: EVALUATING THE TOP
    CONTENDERS
    Akmal B. Chaudhri
    about.me/akmalchaudhri

    View Slide

  2. page
    © 2015 VoltDB PROPRIETARY page
    INTRODUCTION
    2

    View Slide

  3. page
    VOLTDB OVERVIEW
    Mike Stonebraker
    Founded in 2009 by database luminary
    FAST
    World Record Cloud Benchmark:
    YCSB (Yahoo Cloud Serving Benchmark)
    - 2.4m million tps (transactions per second)
    Other Stonebraker Companies
    Customers
    3
    Technology
    •  In-Memory (but data is durable to disk)
    •  Scale-Out shared-nothing architecture
    •  Reliability and fault tolerance
    •  SQL + Java with ACID
    •  Hadoop and data warehouse integration
    •  Open source and commercially licensed (24X7)
    © 2015 VoltDB PROPRIETARY

    View Slide

  4. page
    VOLTDB BENCHMARK ON AMAZON VIRTUAL AND
    IBM SOFTLAYER BARE-METAL SERVERS
    •  Yahoo Cloud Serving Benchmark (YCSB) is
    a popular industry-standard benchmark for
    cloud databases
    •  AWS – virtualized servers
    •  SoftLayer - bare-metal servers
    •  Workload “B” - 95% reads with 5% updates.
    •  Results: Best in class cloud performance
    (run in the cloud)!
    •  AWS - 285k tps for 3 nodes scaling linearly to
    724k tps for a 12 node cluster
    •  IBM SoftLayer - 1.02 million tps for 3 nodes
    scaling linearly to 2.4 million tps for a 12 node
    cluster
    SoftLayer
    AWS
    SoftLayer: Update and Read Latency
    Latency (ms)
    Throughput (ops/sec)
    © 2015 VoltDB PROPRIETARY

    View Slide

  5. page
    PREDICTION
    5
    All businesses will compete
    on their ability to make
    decisions “in the moment”
    using Fast Data.
    © 2015 VoltDB PROPRIETARY

    View Slide

  6. page
    FAST DATA SOURCES AND DRIVERS
    Mobile
    IoT
    Social
    Sensors
    Logs
    Data is doubling every two years
    •  26 billion connected devices by
    2020 (Gartner 2014)
    •  37% of most data will be
    processed at the edge in
    milliseconds (Cisco IoT Study 12/11/14)
    Mobile
    IoT
    6
    © 2015 VoltDB PROPRIETARY

    View Slide

  7. page
    Mobile
    Billing and rights management, subscriber marketing, etc.
    IoT, Energy, Sensor
    Smart grid/meters, asset tracking & management
    Personalized Targeting
    Ad optimization, audience segmenting
    Capital Markets
    Risk, market data management, customer mgt
    Infrastructure
    Data pipeline, system performance, streaming ETL
    EVERY COMPANY HAS FAST DATA PROBLEMS
    UK Smart
    Meter
    7
    VoltDB Customers
    © 2015 VoltDB PROPRIETARY

    View Slide

  8. page
    FAST DATA IS A COMPETITIVE ADVANTAGE TODAY!
    Instant insight
    Instant action
    Instant awareness
    8
    * VoltDB customers
    “Event triggered, real-time
    recommendations based on
    customer behavior have 10-15
    times the response rates than
    mass marketing”
    “We get competitive advantage
    by analyzing device and user
    data to create an interactive
    and personalized consumer
    experience across all devices.”
    “Real time contextual offers
    increase offer uptake rates by
    75% and data revenues by
    15%.”
    *
    *
    © 2015 VoltDB PROPRIETARY

    View Slide

  9. page
    TRADITIONAL RDBMS
    •  Heavy Overhead
    •  1000s of concurrent versions
    •  Contention for locked records
    •  Contention for latching on lock table
    •  Index bottlenecks
    •  Disk I/O bottlenecks
    •  Architecture limits scaling
    © 2015 VoltDB PROPRIETARY 9

    View Slide

  10. page
    ARCHITECTURE IS IMPORTANT
    Fast data requires
    a different
    architecture.
    © 2015 VoltDB PROPRIETARY 10

    View Slide

  11. page
    © 2015 VoltDB PROPRIETARY page
    BIG DATA + FAST DATA
    11

    View Slide

  12. page
    Collect' Explore'
    (Data'Science)'
    Analyze'
    Act'
    (Discoveries/'
    Op:miza:ons)'
    Big data
    ecosystem has
    several
    components
    © 2015 VoltDB PROPRIETARY 12

    View Slide

  13. page
    DATA ARCHITECTURE FOR FAST + BIG DATA
    Enterprise Apps
    ETL
    CRM ERP Etc.
    Data Lake
    (HDFS, etc.)
    BIG DATA
    SQL on
    Hadoop
    Map
    Reduce
    Exploratory
    Analytics
    BI
    Reporting
    Fast Operational
    Database
    FAST DATA
    Export
    Ingest /
    Interactive
    Real-time
    Analytics
    Fast Serve
    Analytics
    Decisioning
    13
    © 2015 VoltDB PROPRIETARY

    View Slide

  14. page
    Calculations Serving of Results
    Real Time, Per Event, Interactive
    VOLTDB AND FAST DATA PIPELINE
    14
    © 2015 VoltDB PROPRIETARY

    View Slide

  15. page
    IN THE BIG CORNER
    Systems facilitating exploration and analytics of large collections.
    15
    Example Technologies
    Columnar OLAP warehouses
    Hadoop Ecosystem
    •  MapReduce
    •  Hive, Pig
    •  SQL.next: Impala, Drill, Shark
    Example Applications
    •  User segmentation & pre-scoring
    •  Seasonal trending
    •  Recommendation matrices
    •  Building search indexes
    •  Data Science: statistical clustering,
    machine learning
    © 2015 VoltDB PROPRIETARY

    View Slide

  16. page
    IN THE FAST CORNER
    Systems facilitating real time ingest, analytics and decisions against
    incoming streams of events.
    16
    Example Technologies
    •  Streaming frameworks (e.g. Spark)
    •  Fast OLAP (e.g. HANA)
    •  Fast OLTP (e.g. VoltDB)
    Example Applications
    •  Micro-personalization
    •  Recommendation serving
    •  Alerting/alarming
    •  Operational monitoring
    •  Data enrichment (ETL elimination)
    •  High throughput authorization
    •  Ex: API quota enforcement
    © 2015 VoltDB PROPRIETARY

    View Slide

  17. page
    TYPICAL FAST DATA QUESTIONS
    17
    Hadoop&
    Volume'
    SQL&/&OLAP&
    Data'Science'
    Fast&
    Velocity'
    •  Is the fast layer streaming?
    •  It is often more like fast OLTP
    •  How do the pieces communicate?
    •  OLAP analytics from Big -> Fast
    •  New events from Fast -> Big
    •  Where do “analytics” belong?
    •  Analytics per-event: with Fast
    •  Analytics across history: with Big
    •  Are streaming frameworks equivalent?
    •  Traditional SQL CEP (Esper, Streambase)
    •  Tuple DAGs (Storm)
    •  Window processors on Hadoop (Spark)
    &
    © 2015 VoltDB PROPRIETARY

    View Slide

  18. page
    HOW TO SOLVE IT*
    18
    *"With"admiring"credit"to"G."Polya"
    Considering'Data' Considering'Processing'
    What&are&the&types&of&
    data&to&be&managed&in&
    fast&data&applica>ons?&
    How&does&data&flow&
    through&fast&data&
    applica>ons?&
    What&are&the&
    calcula>ons&&&analy>cs&
    that&are&necessary?&
    © 2015 VoltDB PROPRIETARY

    View Slide

  19. page
    Data Temporality
    Incoming events Click stream, tick stream, sensors,
    metrics
    Real-Time
    Analytic Results
    Event metadata Device version, location, user
    profiles, point-of-interest data
    OLAP Analytics Used in
    Real-Time Decisions
    Responses/side effects
    Examples
    Event Stream
    Persistent
    (Queryable)
    Persistent
    (Look-Ups)
    Outgoing
    events
    Persistent
    (Look-Ups)
    Event Stream
    Event Stream
    Counters, streaming aggregates,
    Time-series rollups
    Scoring models, seasonal usage,
    demographic trends
    Policy enforcement decisions,
    personalization recommendations
    Enriched, filtered, correlated
    transform of input feed
    © 2015 VoltDB PROPRIETARY 19

    View Slide

  20. page
    SOURCES OF STATE
    1.  Analytics outputs must be query-able.
    2.  “Lookup tables” to create groupings for analytics
    and to supply enrichment data.
    3.  Session managements: grouping, filtering and
    aggregating create intermediate state.
    20
    © 2015 VoltDB PROPRIETARY

    View Slide

  21. page 21
    Considering'Data' Considering'Processing'
    What&are&the&types&of&
    data&to&be&managed&in&
    fast&data&applica>ons?&
    How&does&data&flow&
    through&fast&data&
    applica>ons?&
    What&are&the&
    calcula>ons&&&analy>cs&
    that&are&necessary?&
    © 2015 VoltDB PROPRIETARY

    View Slide

  22. page
    DATA FLOWS
    Real-time Analytics
    •  Streaming summaries for operations
    •  KPI measurement
    •  Analytics for apps
    22
    Real-Time Analytics
    © 2015 VoltDB PROPRIETARY

    View Slide

  23. page
    DATA FLOWS
    23
    Fast Request/Response (and side effects)
    •  Mobile Authorization
    •  Campaign Evaluation
    •  Quota Enforcement
    •  Micro-Personalization
    •  Recommendation Serving
    Request/
    Response
    © 2015 VoltDB PROPRIETARY

    View Slide

  24. page
    DATA FLOWS
    Data Pipelines
    •  Data enrichment
    •  Sessionization and re-assembly of incoming events.
    •  Correlation (by time, location, identity)
    •  Filtering
    24
    Pipeline
    Data Lake
    © 2015 VoltDB PROPRIETARY

    View Slide

  25. page 25
    Considering'Data' Considering'Processing'
    What&are&the&types&of&
    data&to&be&managed&in&
    fast&data&applica>ons?&
    How&does&data&flow&
    through&fast&data&
    applica>ons?&
    What&are&the&
    calcula>ons&&&analy>cs&
    that&are&necessary?&
    © 2015 VoltDB PROPRIETARY

    View Slide

  26. page 26
    Continuous Query
    Transactional Event
    Evaluation
    Transformation
    © 2015 VoltDB PROPRIETARY

    View Slide

  27. page
    FAST DATA STACK
    Applications, Message Queues, Data Sources
    Ingest
    Analyze Decide
    •  Counters
    •  Aggregations
    •  Time series
    •  Statistics
    •  Store results
    •  Query and
    recombine
    •  Fast serving
    •  Per-event policy evaluations
    •  Responses (synchronous):
    authorization, personalization
    •  Side-effects (asynchronous): alerts,
    alarms
    Export & Pipeline
    © 2015 VoltDB PROPRIETARY 27

    View Slide

  28. page 28
    Applications, Message Queues, Data Sources
    Ingest
    Analyze Decide
    Counters
    Aggregations
    Time series
    Statistics
    Store results
    Query and
    recombine
    Fast serving
    Per-event policy evaluations
    Responses (synchronous)
    Side-effects (asynchronous)
    Export & Pipeline
    APACHE-ISH TECHNOLOGY STACK
    Kafka / RabbitMQ
    Storm, Flume, Sqoop
    Storm +
    Serving Layer
    Spark +
    Serving Layer
    Cassandra,
    HBase
    Hadoop, Message queues
    © 2015 VoltDB PROPRIETARY

    View Slide

  29. page 29
    Applications, Message Queues, Data Sources
    Ingest
    Analyze Decide
    Counters
    Aggregations
    Time series
    Statistics
    Store results
    Query and
    recombine
    Fast serving
    Per-event policy evaluations
    Responses (synchronous)
    Side-effects (asynchronous)
    Export & Pipeline
    VOLTDB TECHNOLOGY STACK
    Kafka / RabbitMQ
    VoltDB
    SQL, Java for
    Analytics
    Transactions /
    ACID
    Hadoop, Message queues
    © 2015 VoltDB PROPRIETARY

    View Slide

  30. page 30
    OLTP
    (Transactions First)
    Streaming
    Event Processors
    OLAP
    (Columnar Analytics)
    © 2015 VoltDB PROPRIETARY

    View Slide

  31. page 31
    Applications, Message Queues, Data Sources
    Ingest
    Analyze Decide
    Counters
    Aggregations
    Time series
    Statistics
    Store results
    Query and
    recombine
    Fast serving
    Per-event policy evaluations
    Responses (synchronous)
    Side-effects (asynchronous)
    Export & Pipeline
    STREAM TECHNOLOGY STACK
    © 2015 VoltDB PROPRIETARY

    View Slide

  32. page 32
    Applications, Message Queues, Data Sources
    Ingest
    Analyze Decide
    Counters
    Aggregations
    Time series
    Statistics
    Store results
    Query and
    recombine
    Fast serving
    Per-event policy evaluations
    Responses (synchronous)
    Side-effects (asynchronous)
    Export & Pipeline
    OLAP TECHNOLOGY STACK
    © 2015 VoltDB PROPRIETARY

    View Slide

  33. page
    Applications
    &
    Streams
    Logs, Sensors,
    Meter Readings,
    IoT, Location
    Real-Time
    Applications
    Message Queue
    Ingest
    Kafka Loader
    CSV loaders
    C++, C#, PHP, Python
    Java (and others)
    Export
    CSV Data
    Thrift Messages
    JDBC
    HTTP
    Local File
    Extensible Connectors
    SQL
    Views
    Java
    Analyze
    ACID
    Txns
    State
    Decide
    Downstream
    Pipeline
    Hadoop
    Data Warehouse
    Message Queue
    STREAMING DATA PIPELINE
    © 2015 VoltDB PROPRIETARY 33

    View Slide

  34. page
    © 2015 VoltDB PROPRIETARY page
    FAST DATA PATTERNS
    34

    View Slide

  35. page
    THREE FAST DATA APPLICATION PATTERNS
    •  Real-Time Analytics
    •  Real-time analytics for operations
    •  Real-time KPI measurement
    •  Real-time analytics for apps
    •  Data Pipelines
    •  Streaming data enrichment
    •  Sessionization / re-assembly
    •  Correlation (by time, by location, by id)
    •  Filtering
    •  Pre-aggregation
    35
    •  Fast Request/Response
    •  Mobile Authorization
    •  Campaign Authorization
    •  Fast API Quota Enforcement
    •  Micro-Personalization
    •  Recommendation Serving
    © 2015 VoltDB PROPRIETARY

    View Slide

  36. page
    VOLTDB: REAL-TIME ANALYTICS
    36
    VoltDB
    Metadata
    (Dimension table)
    Session state
    (Fact table) •  Operational analytics and
    monitoring
    •  RT analytics enabling user-
    facing applications
    •  KPI for internal BI/Dashboards
    •  In-memory MPP SQL over
    ODBC/JDBC
    •  Cheap + correct materialized
    views for streaming
    aggregations
    SQL, Views
    Ingest
    © 2015 VoltDB PROPRIETARY

    View Slide

  37. page
    VOLTDB: DATA PIPELINES WITH EXPORT
    37
    VoltDB
    Metadata
    (Dimension table)
    Session state
    (Fact table)
    •  Filtering (ex: only RFID /
    iBeacon readings that show
    change from previous
    location).
    •  Sessionization
    •  Common version re-writing
    •  Data enrichment
    •  MPP streaming Export
    •  Row data, Thrift messages, CSV
    •  OLAP, HDFS and message
    queues
    Export
    © 2015 VoltDB PROPRIETARY

    View Slide

  38. page
    VOLTDB: REQUEST/RESPONSE DECISIONS
    38
    •  Authorization
    •  RT balance checks, quota
    enforcement
    •  Personalization and
    Recommendation Serving
    •  Combine pre-score with
    immediate context
    •  Fully ACID transaction model.
    •  Thousands to Millions per
    second
    •  At less than 5ms latencies
    Metadata&
    (Dimension&table)&
    Session&state&
    (Fact&table)&
    ACID&Transac>ons&
    © 2015 VoltDB PROPRIETARY

    View Slide

  39. page
    © 2015 VoltDB PROPRIETARY page
    VOLTDB V5.0
    39

    View Slide

  40. page
    VOLTDB V5.0 – ACCELERATING FAST DATA
    APPLICATION DEVELOPMENT
    •  Hadoop/Big Data Ecosystem Integrations
    •  Fast Data Pipeline Sample Applications
    •  Ease of Database Development (traditional API)
    •  VoltDB Management Center (VMC)
    •  Updated Hortonworks HDP Certification
    40
    © 2015 VoltDB PROPRIETARY

    View Slide

  41. page
    FAST DATA INTEGRATIONS - IMPORTERS
    •  Kafka Loader
    •  Subscribe to a Kafka topic and insert each message into a VoltDB
    Table
    •  JDBC Loader
    •  Load a JDBC result set into a VoltDB Table
    •  Vertica Udx
    •  User-defined function to load Vertica result sets into a VoltDB
    Table
    •  Apache Hive and Apache Pig
    •  Hadoop OutputFormat to load Hive and Pig result sets into VoltDB
    © 2015 VoltDB PROPRIETARY 41

    View Slide

  42. page
    FAST DATA INTEGRATIONS - EXPORTERS
    •  HDFS Export
    •  Hadoop export via WebHDFS and HttpFS
    •  HTTP Export
    •  Delivery and Alerting via HTTP post/get
    •  Kafka Export, RabbitMQ Export
    •  Message queue delivery
    •  Export format configurable
    •  Avro, CSV, TSV, more coming…
    © 2015 VoltDB PROPRIETARY 42

    View Slide

  43. page
    FAST DATA PIPELINE SAMPLE APPLICATION
    •  Streaming Data, Real-time Analytics
    •  Export to Hadoop
    •  Export to OLAP (Vertica, others)
    •  Place historical decision making intelligence into VoltDB
    •  Closed Loop, via Hive, Pig OutputFormat or Vertica Udx
    •  Download: https://github.com/VoltDB/app-fastdata
    •  And see our blog posts:
    http://voltdb.com/blog/fast-data-look-voltdb-sample-app
    © 2015 VoltDB PROPRIETARY 43

    View Slide

  44. page
    LAMBDA ARCHITECTURE SAMPLE APPLICATION
    •  Type of application: Real-time analytics
    •  Demonstrates how to simplify the “Speed
    Layer”
    •  Using VoltDB, developers can replace both the
    streaming and the operational data store portions of
    the speed layer.
    •  Less code, greatly reduced complexity
    •  Improving the Lambda Architecture
    •  Perform real-time analytics AND react, per event, to
    the incoming data stream
    •  Try it yourself: http://voltdb.com/community/applications
    HOW MANY UNIQUE
    USERS INTERACTED WITH
    MY APP TODAY?
    © 2015 VoltDB PROPRIETARY 44

    View Slide

  45. page
    VOLTDB MANAGEMENT CENTER (VMC)
    A browser-based management tool for monitoring, examining, and querying a running VoltDB database
    © 2015 VoltDB PROPRIETARY 45

    View Slide

  46. page
    UPDATED HORTONWORKS CERTIFICATION
    © 2015 VoltDB PROPRIETARY 46

    View Slide

  47. page
    © 2015 VoltDB PROPRIETARY page
    CUSTOMER CASE STUDIES
    47

    View Slide

  48. page
    60 Million meters under management,
    saving millions in efficiency, reduced waste
    VOLTDB DELIVERS SUPERIOR CUSTOMER VALUE
    Customers Business Value
    Internet Service
    Provider
    Discover 100% of DoS attacks, and
    improved response time by 97%
    Communications
    Service Provider
    Improved infrastructure utilization
    by 150%
    Online Game Analytics
    Increased free-to-pay conversion rate
    by 30%
    Mobile Network Management
    Saves $0.5 million/customer installation;
    unlimited scale in the cloud
    Mobile Ad Service
    Provider
    OpEx – 93% reduction in servers (100 to 7)
    Saved millions in ad budget overages
    48
    Smart Meter, Energy
    Management
    © 2015 VoltDB PROPRIETARY

    View Slide

  49. page 49
    © 2015 VoltDB PROPRIETARY

    View Slide

  50. page
    TRY V5.0 TODAY FOR FREE
    •  VoltDB Enterprise Edition
    •  Production-ready
    •  Fully durable, highly available
    •  Commercial license, fully supported
    •  http://voltdb.com/download/software
    •  Sample apps (in a Docker container)
    •  http://voltdb.com/community/demo
    •  VoltDB Community Edition – open source
    •  http://github.com/voltdb
    VoltDB runs over 6 BILLION transactions/day in production!
    © 2015 VoltDB PROPRIETARY 50

    View Slide

  51. Capability Spark,Streaming Storm TIBCO,Streambase IBM,Streams Google,Dataflow Amazon,Kinesis VoltDB
    Focus Micro&Batching&for&Hadoop
    Infrastructure&for&data&
    capture Complex&Event&Processing
    Stream&processing&and&
    analytics&without&queries
    Next&gen&MapReduce&in&the&
    cloud
    Infrastructure&for&data&
    capture
    Stream&processing,&analytics&with&
    queries,&and&realCtime&decision&
    making
    Programming&Model Java,&Scala Clojure,&Java,&Ruby,&Python SQL
    Proprietary&C&Stream&
    Processing&Language&(SPL) Java Java
    Java,&Relational,&SQL,&ACIDC
    compliant
    Latency&(milliseconds) >&&1,000&milliseconds milliseconds 1&millisecond 1&millisecond >&&2,000&milliseconds 35C100&milliseconds 1&milllisecond
    Data&Capture/Ingestion Batch
    ! ! ! ! ! !
    Stateful,Operation X X X X X X
    !
    Ad,hoc,queries
    Interactive,SQL X X X X X X
    !
    Analytics,w/o,Queries
    ! with&add&on&DDLs
    ! ! ! ! !
    Analytics,with,queries,and,perKevent,
    decision,making X X X X X X
    !
    Real&time&Data&Enrichment
    Using&metadata&to&enrich,&denormalize,&etc.,&
    incoming&event&streams X X X X X X
    !
    Apply&OLAP&results&to&real&time&data&stream X X X
    ! X X
    !
    ScaleCout&architecture
    ! ! X
    ! ! ! !
    Reliability:&ability&to&persist&data X X X X X
    !
    Fault&Tolerant
    ! ! ! ! ! !
    Requires&Zookeeper&for&HA
    Reliability:&ability&to&persist&data X X
    ! ! X X
    !
    Cluster&&&Resource&Management Need&to&addCon&Zookeeper
    Need&to&addCon&Zookeeper;&
    supports&YARN BuiltCIn BuiltCIn BuiltCIn BuiltCIn BuiltCIn
    Support Cloudera Hortonworks TIBCO IBM Google Amazon VoltDB
    Output&(OLAP&Integration) HDFS,&Flume,&Kafka,,&ZeroMQ HDFS,&Kafka,&Redis,&RDBMS
    HDFS,&CSV,&IBM&Netezza,&HP&
    Vertica,&&Microsoft,&Oracle,&
    Sybase
    HDFS,&CSV,&IBM&Netezza,&HP&
    Vertica,&&Microsoft,&Oracle,&
    Sybase Google Amazon
    HDFS,&Kafka,&RabbitMQ,&CSV,&
    Netezza,&HP&Vertica,&JDBC
    Available&as&Open&Source Yes,&Apache&license Yes,&Apache&license X X X X Yes,&AGPL&License
    Comparing,Fast,Data,Application,Platforms:,From,Simple,Streaming,to,RealKTime,Interaction,with,Decision,Making
    Ingestion&&&&C>&&&Analytics&&w/o&Queries&&&&&C>&&&&&Analytics&with&queries&&&&&C&>&&&&Data&Enrichment&C>&&&Real&time&Decisions
    Fast,data,applications,three,unique,requirements:,rapid,data,ingestion,,realKtime,analytics,on,streaming,data,,and,per,event,realKtime,decisions

    View Slide