Real-time Big Data ingestion and querying of aggregated data @ Infobip DevDays 2016 (Druid, Apache Kafka)

www.infobip.com REAL-TIME BIG DATA INGESTION AND QUERYING OF AGGREGATED DATA
Davor Poldrugo software engineer

Davor Poldrugo @ Infobip Software engineer with interest in backend
development, high availability and distributed systems. Team BC https://about.me/davor.poldrugo

Presentation overview • Dictionary • The real-time use case and
the challenges (because there are no problems ;) • The state of the affairs • Our path towards real-time data • Architecture and component overview • Druid design • Numbers and conclusion

Dictionary REAL-TIME noun “the actual time during which something takes
place <the computer may partly analyze the data in real time (as it comes in) — R. H. March> <chatted online in real time> – real-time adjective” http://www.merriam-webster.com/dictionary/real%20time BIG DATA noun “an accumulation of data that is too large and complex for processing by traditional database management tools” http://www.merriam-webster.com/dictionary/big%20data

Dictionary INGEST verb “to take (something, such as food) into
your body : to swallow (something) — sometimes used figuratively She ingested [=absorbed] large amounts of information very quickly.” http://www.learnersdictionary.com/definition/ingest I'll use this figurative meaning... in context of data ingestion.

The real-time use case and the challenges • Our new
web requirement: provide real-time data and graphs of traffic • SMS Campaigns Web application Near real-time • But we wanted real-time!

The real-time use case and the challenges { "accountId":1, "campaignId":29680,
"text":”Single campaign message”, "…":"..." } Payload

The state of the affairs Upload Tool IpCore DB 1
Customer Portal (CUP) IpCore DB X GREEN IpCore Node 1 IpCore Node X PULL PULL PUSH PUSH

The state of the affairs Upload Tool Customer Portal (CUP)
GREEN IpCore Node X Customer Portal (CUP) t_green_delay ~ 1-60 min query latency < 100 ms

Our path towards real-time data • GREEN ODS/DWH provides a
solution for all our traffic data but is not in REAL-TIME • GREEN consists of big hardware – scales vertically • This approach tries to solve a particular REAL-TIME use case – one by one – not a silver bullet! • Because REAL-TIME isn't always needed • Resources are limited • The path towards horizontal scalability

Our path towards real-time data 1. All data entering the
system is dispatched to both the batch layer and the speed layer for processing. 2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views. 3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way. 4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. 5. Any incoming query can be answered by merging results from batch views and real-time views. Lambda architecture ( http://lambda-architecture.net/ )

Our path towards real-time data Know thyself! Adapt lambda architecture
to fit your needs! IpCore (Core Message Processing) IpCore (Core Message Processing) Messaging Cloud Transactional Databases (OLTP) Payload Payload Payload GREEN DB ODS DWH (newly proclaimed BATCH/SERVING LAYER) REAL-TIME LAYER QUERY LAYER (queries REAL-TIME OR BATCH) Ingest point Ingest point Messaging Cloud App Customer Portal (CUP) ...

Architecture and component overview Messaging Cloud Payload Payload Payload REAL-TIME
LAYER ... Data Ingestion Service Process Message Process Delta Pairing and composing a new message Kafka cluster Druid cluster Billing ingest point IpCore ingest point

Architecture and component overview REAL-TIME LAYER Data Ingestion Service Kafka
cluster Druid cluster { "sendDateTime":"2016-02-19T12:07:47Z", "campaignId":29680, "currencyId":2, "currencyHNBCode":"EUR", "currencySymbol":"€", "countDelta":1, "priceDelta":0.02 }

Architecture and component overview REAL-TIME LAYER Kafka cluster Druid cluster
Data Ingestion Service GREEN DB ODS DWH BATCH LAYER QUERY LAYER Data Query Service Messaging Cloud App Messaging Cloud App Messaging Cloud App Messaging Cloud App Messaging Cloud App Is realtime? TRUE FALSE

Architecture and component overview REAL-TIME LAYER Druid cluster QUERY LAYER
Data Query Service POST /druid/v2 HTTP/1.1 Host: druid-broker-node:8080 Content-Type: application/json { "queryType": "groupBy", "dataSource": "campaign-totals-v2", "granularity": "all", "intervals": [ "2012-01-01T00:00:00.000/2100-01-01T00:00:00.000" ], "dimensions": ["campaignId", "currencyId", "currencySymbol", "currencyHNBCode"], "filter": { "type": "selector", "dimension": "campaignId", "value": 29680 }, "aggregations": [ { "type": "longSum", "name": "totalCountSum", "fieldName": "totalCount" }, { "type": "doubleSum", "name": "totalPriceSum", "fieldName": "price" } ] } Request to Druid

Architecture and component overview REAL-TIME LAYER Druid cluster QUERY LAYER
Data Query Service Response from Druid [ { "version": "v1", "timestamp": "2012-01-01T00:00:00.000Z", "event": { "totalCountSum": 1000000, "currencyid": "2", "totalPriceSum": 20000, "currencysymbol": "€", "currencyhnbcode": "EUR", "campaignid": "29680" } } ]

Kafka https://kafka.apache.org/ • Kafka maintains feeds of messages in categories
called topics • A distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. FEATURES • two messaging models incorporated in an abstraction called consumer group (group id) – queue and publish- subscribe – queue - a pool of consumers may read from a server and each message goes to one of them – publish-subscribe - the message is broadcast to all consumers • constant performance with respect to data size • replay – all messages are stored and can be accessd with a sequential id number called the offset REAL-TIME LAYER Kafka cluster Druid cluster Data Ingestion Service REAL-TIME LAYER Kafka cluster Druid cluster Data Ingestion Service

DRUID http://druid.io/ Druid is a fast column-oriented distributed shared nothing
data store used for analytics - aggregation of data. Real-time Streams Druid supports streaming data ingestion and offers insights on events immediately after they occur. Retain events indefinitely and unify real-time and historical views. Sub-Second Queries Druid supports fast aggregations and sub-second OLAP queries. Scalable to Petabytes Existing Druid clusters have scaled to petabytes of data and trillions of events, ingesting millions of events every second. Druid is extremely cost effective, even at scale. Deploy Anywhere Druid runs on commodity hardware. Deploy it in the cloud or on- premise. Integrate with existing data systems such as Hadoop, Spark, Kafka, Storm, and Samza. REAL-TIME LAYER Kafka cluster Druid cluster Data Ingestion Service REAL-TIME LAYER Kafka cluster Druid cluster Data Ingestion Service

DRUID http://druid.io/ REAL-TIME LAYER Kafka cluster Druid cluster Data Ingestion
Service REAL-TIME LAYER Kafka cluster Druid cluster Data Ingestion Service Dimensions (things to filter on): publisher advertiser gender country Metrics (things to aggregate over): click price

DRUID – The Data and roll-up http://druid.io/ timestamp publisher advertiser
gender country click price 2011-01-01T01:01:35Z bieberfever.com google.com Male USA 0 0.65 2011-01-01T01:03:63Z bieberfever.com google.com Male USA 0 0.62 2011-01-01T01:04:51Z bieberfever.com google.com Male USA 1 0.45 2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53 SELECT timestamp, publisher, advertiser, gender, country, SUM(click) as clicks, SUM(price) as revenue GROUP BY granularity(timestamp), publisher, advertiser, gender, country timestamp publisher advertiser gender country clicks revenue 2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 25 15.70 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 42 29.18 2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 17 17.31 2011-01-01T02:00:00Z bieberfever.com google.com Male UK 170 34.01

DRUID – Sharding the data - segments http://druid.io/ • Druid
shards are called segments and Druid always first shards data by time. In our compacted data set, we can create two segments, one for each hour of data. Segment sampleData_2011-01-01T01:00:00:00Z_2011-01-01T02:00:00:00Z_v1_0 contains 2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 25 15.70 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 42 29.18 Segment sampleData_2011-01-01T02:00:00:00Z_2011-01-01T03:00:00:00Z_v1_0 contains 2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 17 17.31 2011-01-01T02:00:00Z bieberfever.com google.com Male UK 170 34.01 • Segments are self-contained containers for the time interval of data they hold. Segments contain data stored in compressed column orientations, along with the indexes for those columns. Druid queries only understand how to scan segments.

DRUID – Segments http://druid.io/

DRUID – Design http://druid.io/

DRUID – Segments http://druid.io/

DRUID – Tiers and loading rules http://druid.io/ Druid cluster HOT
TIER COLD TIER FAST MACHINES SLOW MACHINES Recent data segments Older data segments

The new state of the affairs Upload Tool Customer Portal
(CUP) New lambda architecture wannabe IpCore Node X t_delay < 2 sec query latency < 500 ms *

Numbers and conlusion Data pipeline Max. throughput (msg/s) Ingest points
→ Data Ingestion Service 7700 Billing ingest point → Data Ingestion Service 5500 IpCore ingest point → Data Ingestion Service 2200 Data Ingestion service → Kafka 2130 Druid firehose pull and aggregate from Kafka 29000 Real-time! <2 sec delay

Numbers and conclusion PROBLEMS / CHALLENGES ;) • Added complexity
to the flow – Maintenance of “ingest point” code – Maintenance of Data Ingestion Service – Operational knowledge of Kafka / Druid • Scaling Druid – problems with “Druid realtme nodes” and Kafka topics with multiple partitions • Druid - Exactly once semantics are not guaranteed with real-time ingestion in Druid – but we didn't have problems with our configuration - definitive solution – Druid batch ingestion using Tranquility

www.infobip.com Q/A Davor Poldrugo software engineer [email protected] REAL-TIME BIG DATA
INGESTION AND QUERYING OF AGGREGATED DATA

Real-time Big Data ingestion and querying of ag...

Real-time Big Data ingestion and querying of aggregated data @ Infobip DevDays 2016 (Druid, Apache Kafka)

Davor Poldrugo

More Decks by Davor Poldrugo

Other Decks in Technology

Featured

Transcript

www.infobip.com REAL-TIME BIG DATA INGESTION AND QUERYING OF AGGREGATED DATA

Davor Poldrugo @ Infobip Software engineer with interest in backend

Presentation overview • Dictionary • The real-time use case and

Dictionary REAL-TIME noun “the actual time during which something takes

Dictionary INGEST verb “to take (something, such as food) into

The real-time use case and the challenges • Our new

The real-time use case and the challenges { "accountId":1, "campaignId":29680,

The state of the affairs Upload Tool IpCore DB 1

The state of the affairs Upload Tool Customer Portal (CUP)

Our path towards real-time data • GREEN ODS/DWH provides a

Our path towards real-time data 1. All data entering the

Our path towards real-time data Know thyself! Adapt lambda architecture

Architecture and component overview Messaging Cloud Payload Payload Payload REAL-TIME

Architecture and component overview REAL-TIME LAYER Data Ingestion Service Kafka

Architecture and component overview REAL-TIME LAYER Kafka cluster Druid cluster

Architecture and component overview REAL-TIME LAYER Druid cluster QUERY LAYER

Architecture and component overview REAL-TIME LAYER Druid cluster QUERY LAYER

Kafka https://kafka.apache.org/ • Kafka maintains feeds of messages in categories

DRUID http://druid.io/ Druid is a fast column-oriented distributed shared nothing

DRUID http://druid.io/ REAL-TIME LAYER Kafka cluster Druid cluster Data Ingestion

DRUID – The Data and roll-up http://druid.io/ timestamp publisher advertiser

DRUID – Sharding the data - segments http://druid.io/ • Druid

DRUID – Segments http://druid.io/

DRUID – Design http://druid.io/

DRUID – Design http://druid.io/

DRUID – Segments http://druid.io/

DRUID – Tiers and loading rules http://druid.io/ Druid cluster HOT

The new state of the affairs Upload Tool Customer Portal

Numbers and conlusion Data pipeline Max. throughput (msg/s) Ingest points

Numbers and conclusion PROBLEMS / CHALLENGES ;) • Added complexity

www.infobip.com Q/A Davor Poldrugo software engineer [email protected] REAL-TIME BIG DATA