Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-time Big Data ingestion and querying of aggregated data @ Javantura 2016 (Druid, Apache Kafka)

Real-time Big Data ingestion and querying of aggregated data @ Javantura 2016 (Druid, Apache Kafka)

View data in real time in Big Data environment is becoming more and more challenging. Classical transactional systems and data replication encounter more obstacles in Big Data environment. One of those obstacles is large latency between time when data entered in the system and time when data is ready for querying. In this presentation will be shown the path that Infobip has chosen to try to achieve real time in Big Data environment.

Keywords: Lambda architecture, Redis.io, Apache Kafka, druid.io

https://javantura.com/sessions/#rtbigdata

Davor Poldrugo

February 20, 2016
Tweet

More Decks by Davor Poldrugo

Other Decks in Technology

Transcript

  1. Davor Poldrugo @ Infobip Software engineer with interest in backend

    development, high availability and distributed systems. https://about.me/davor.poldrugo
  2. • MOBILE SERVICES: Professional SMS, number validation, voice, USSD, mobile

    payments; deeply integrated into the telecoms world • ENTERPRISE PRODUCTS for businesses of any scale and need (mGate, fully-featured web apps, SMS authentication solutions, reseller solutions...) • APP ENGAGEMENT PLATFORM based on advanced push notifications • APIs and protocols for EASY INTEGRATION: xml, soap/rest, smpp, http, json • Full 24/7 TECHNICAL SUPPORT regardless of location • QUALITY guaranteed by a strict SLA Our services
  3. Presentation overview • Dictionary • The real-time use case and

    the challenges (because there are no problems ;) • The platform and how we got here • Our path towards real-time data • Architecture and component overview • Numbers and conclusion
  4. Dictionary REAL-TIME noun “the actual time during which something takes

    place <the computer may partly analyze the data in real time (as it comes in) — R. H. March> <chatted online in real time> – real-time adjective” http://www.merriam-webster.com/dictionary/real%20time BIG DATA noun “an accumulation of data that is too large and complex for processing by traditional database management tools” http://www.merriam-webster.com/dictionary/big%20data
  5. Dictionary INGEST verb “to take (something, such as food) into

    your body : to swallow (something) — sometimes used figuratively She ingested [=absorbed] large amounts of information very quickly.” http://www.learnersdictionary.com/definition/ingest I'll use this figurative meaning... in context of data ingestion.
  6. The real-time use case and the challenges • Our new

    web requirement: provide real-time data and graphs of traffic • SMS Campaigns Web application Near real-time • But we wanted real-time!
  7. The platform and how we got here • There was

    only one node – a monolith • One transactional database (OLTP) • Traffic increased • After a while the database began to be a bottleneck • Then we introduced multiple transaction databases • Then multiple monolith nodes were introduced – one per database • Then load balancers were needed
  8. The platform and how we got here • After that

    querying has become complex: – when one or more databases down for maintenance - data from that DB is missing – queries had to span over multiple databases and then results had to be joined – aggregate reports become a problem (complexity, availability) – aggregation databases introduced (ETL) that pulled from transactional databases • In the meantime we decoupled our monolithic node to lots of microservice nodes (IpCore, Billing, Contacts, Campaigns, ...) • As traffic increased, non-transactional (apps, reports) queries become a problem – throughput decrease
  9. The platform and how we got here • Our Database

    Team introduced GREEN – our ODS/DWH – named after the color of the pencil used to draw on the board ;) – Near real-time ETL (for traffic tables with 150+ columns) – Centralized reporting – Decreased workload from transactional databases – Throughput increase of our core nodes (IpCore) – Specialized indexes – Specialized aggregations – But still... near real-time... – 1 to 60 minutes out of sync – with the transactional databases (depending on the load)
  10. Our path towards real-time data • GREEN ODS/DWH provided an

    abstract solution for all our traffic data but was not in REAL-TIME • GREEN consists of big hardware – scales vertically • This approach tries to solve a particular REAL-TIME use case – one by one – not a silver bullet! • Because REAL-TIME isn't always needed • Resources are limited • The path towards horizontal scalability
  11. Our path towards real-time data 1. All data entering the

    system is dispatched to both the batch layer and the speed layer for processing. 2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views. 3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way. 4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. 5. Any incoming query can be answered by merging results from batch views and real-time views. Lambda architecture ( http://lambda-architecture.net/ )
  12. Our path towards real-time data Know thyself! Adapt lambda architecture

    to fit your needs! IpCore (Core Message Processing) IpCore (Core Message Processing) Messaging Cloud Transactional Databases (OLTP) App Message Event App Message Event App Message Event GREEN DB ODS DWH (newly proclaimed BATCH/SERVING LAYER) REAL-TIME LAYER QUERY LAYER (queries REAL-TIME OR BATCH) Ingest point Ingest point Messaging Cloud App Messaging Cloud App ...
  13. Architecture and component overview Messaging Cloud App Message Event App

    Message Event App Message Event REAL-TIME LAYER ... Data Ingestion Service Process Message Process Delta Pairing and composing a new message Kafka cluster Druid cluster Billing ingest point IpCore ingest point
  14. Architecture and component overview REAL-TIME LAYER Data Ingestion Service Kafka

    cluster Druid cluster { "sendDateTime":"2016-02-19T12:07:47Z", "campaignId":29680, "currencyId":2, "currencyHNBCode":"EUR", "currencySymbol":"€", "countDelta":1, "priceDelta":0.02 }
  15. Architecture and component overview REAL-TIME LAYER Kafka cluster Druid cluster

    Data Ingestion Service GREEN DB ODS DWH BATCH LAYER QUERY LAYER Data Query Service Messaging Cloud App Messaging Cloud App Messaging Cloud App Messaging Cloud App Messaging Cloud App Is realtime? TRUE FALSE
  16. Architecture and component overview REAL-TIME LAYER Druid cluster QUERY LAYER

    Data Query Service POST /druid/v2 HTTP/1.1 Host: druid-broker-node:8080 Content-Type: application/json { "queryType": "groupBy", "dataSource": "campaign-totals-v2", "granularity": "all", "intervals": [ "2012-01-01T00:00:00.000/2100-01-01T00:00:00.000" ], "dimensions": ["campaignId", "currencyId", "currencySymbol", "currencyHNBCode"], "filter": { "type": "selector", "dimension": "campaignId", "value": 29680 }, "aggregations": [ { "type": "longSum", "name": "totalCountSum", "fieldName": "totalCount" }, { "type": "doubleSum", "name": "totalPriceSum", "fieldName": "price" } ] } Request to Druid
  17. Architecture and component overview REAL-TIME LAYER Druid cluster QUERY LAYER

    Data Query Service Response from Druid [ { "version": "v1", "timestamp": "2012-01-01T00:00:00.000Z", "event": { "totalCountSum": 1000000, "currencyid": "2", "totalPriceSum": 20000, "currencysymbol": "€", "currencyhnbcode": "EUR", "campaignid": "29680" } } ]
  18. Architecture and component overview KAFKA - https://kafka.apache.org/ • Kafka maintains

    feeds of messages in categories called topics • A distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. FEATURES • two messaging models incorporated in an abstraction called consumer group (group id) – queue and publish- subscribe – queue - a pool of consumers may read from a server and each message goes to one of them – publish-subscribe - the message is broadcast to all consumers • constant performance with respect to data size • replay – all messages are stored and can be accessd with a sequential id number called the offset REAL-TIME LAYER Kafka cluster Druid cluster Data Ingestion Service REAL-TIME LAYER Kafka cluster Druid cluster Data Ingestion Service
  19. Architecture and component overview DRUID - http://druid.io/ Druid is a

    fast column-oriented distributed data store. Real-time Streams Druid supports streaming data ingestion and offers insights on events immediately after they occur. Retain events indefinitely and unify real-time and historical views. Sub-Second Queries Druid supports fast aggregations and sub-second OLAP queries. Scalable to Petabytes Existing Druid clusters have scaled to petabytes of data and trillions of events, ingesting millions of events every second. Druid is extremely cost effective, even at scale. Deploy Anywhere Druid runs on commodity hardware. Deploy it in the cloud or on- premise. Integrate with existing data systems such as Hadoop, Spark, Kafka, Storm, and Samza. REAL-TIME LAYER Kafka cluster Druid cluster Data Ingestion Service REAL-TIME LAYER Kafka cluster Druid cluster Data Ingestion Service
  20. Numbers and conlusion Data pipeline Max. throughput (msg/s) Ingest points

    → Data Ingestion Service 7700 Billing ingest point → Data Ingestion Service 5500 IpCore ingest point → Data Ingestion Service 2200 Data Ingestion service → Kafka 2130 Druid firehose pull and aggregate from Kafka 29000 Real-time! <2 sec delay
  21. Numbers and conclusion PROBLEMS / CHALLENGES ;) • Added complexity

    to the flow – Maintenance of “ingest point” code – Maintenance of Data Ingestion Service – Operational knowledge of Kafka / Druid • Scaling Druid – problems with “Druid realtme nodes” and Kafka topics with multiple partitions • Druid - Exactly once semantics are not guaranteed with real-time ingestion in Druid – but we didn't have problems with our configuration - definitive solution – Druid batch ingestion using Tranquility