Building a streaming analytics stack with Kafka and Druid

It slices, it dices!! Building a streaming analytics stack with
Kafka and Druid Gian Merlino gian@imply.io

Who am I? Gian Merlino Committer & PMC member on
Cofounder at 10 years working on scalable systems 2

Agenda • From warehouses to rivers • The problem •
Under the hood • The mysterious future • Do try this at home! 3

Rolling down the river 4

Data warehouses Tightly coupled architecture with limited flexibility. 5 Confidential.
Do not redistribute. Data Data Data Data Sources ETL Data warehouse Processing Store and Compute Analytics Reporting Data mining Querying

Data lakes Modern data architectures are more application-centric. 6 Confidential.
Do not redistribute. Data Data Data Data Sources MapReduce, Spark Apps ETL SQL ML/AI TSDB Data lake Storage

Data rivers Streaming architectures are true-to-life and enable faster decision
cycles. 7 Confidential. Do not redistribute. Data Data Data Data Sources Stream processors Stream hub Real-time analytics ML/AI Archive ETL (Like Kafka) Apps

The problem 8

The problem 9

The problem 10

The problem • Slice-and-dice for big data streams • Interactive
exploration • Look under the hood of reports and dashboards • And we want our data fresh, too 11

Challenges • Scale: when data is large, we need a
lot of servers • Speed: aiming for sub-second response time • Complexity: too much fine grain to precompute • High dimensionality: 10s or 100s of dimensions • Concurrency: many users and tenants • Freshness: load from streams 12

Motivation • Sub-second responses allow dialogue with data • Rapid
iteration on questions • Remove barriers to understanding 13

14 high performance analytics data store for event-driven data

What is Druid? • “high performance”: low query latency, high
ingest rates • “analytics”: counting, ranking, groupBy, time trend • “data store”: the cluster stores a copy of your data • “event-driven data”: fact data like clickstream, network flows, user behavior, digital marketing, server metrics, IoT 15

Key features • Column oriented • High concurrency • Scalable
to 100s of servers, millions of messages/sec • Indexes on all dimensions by default • Query through SQL • Rapid queries on flat tables 16

Use cases • Clickstreams, user behavior • Digital advertising •
Application performance management • Network flows • IoT 17

Powered by Druid 18 Source: http://druid.io/druid-powered.html

Powered by Druid “The performance is great ... some of
the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second.” 19 Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html From Yahoo:

20 Integration patterns

Deployment patterns 21 Stream hub (Kafka) Event streams (millions of
events/sec) • Modern data architecture • Centered around stream hub • Real-time reporting & visibility • Real-time investigation • Troubleshooting / diagnostics • Observability • Fuse realtime & historical data Enrichment (Kafka Streams, KSQL)

Deployment patterns 22 Data lake (Hadoop, S3) File dumps (hourly,
daily) • (Slightly less) modern data architecture • Centered around data lake • Low latency, interactive, ad-hoc reporting • High concurrency Enrichment (Spark, Hive)

Download Druid community site (current): http://druid.io/ Druid community site (new):
https://druid.apache.org/ Imply distribution: https://imply.io/get-started 24

Contribute 25 https://github.com/apache/druid

Stay in touch 26 @druidio Join the community! http://druid.io/community Free
training hosted by Imply! https://imply.io/druid-days Follow the Druid project on Twitter!

Building a streaming analytics stack with Kafka...

Building a streaming analytics stack with Kafka and Druid

Imply

More Decks by Imply

Other Decks in Technology

Featured

Transcript

It slices, it dices!! Building a streaming analytics stack with

Who am I? Gian Merlino Committer & PMC member on

Agenda • From warehouses to rivers • The problem •

Rolling down the river 4

Data warehouses Tightly coupled architecture with limited flexibility. 5 Confidential.

Data lakes Modern data architectures are more application-centric. 6 Confidential.

Data rivers Streaming architectures are true-to-life and enable faster decision

The problem 8

The problem 9

The problem 10

The problem • Slice-and-dice for big data streams • Interactive

Challenges • Scale: when data is large, we need a

Motivation • Sub-second responses allow dialogue with data • Rapid

14 high performance analytics data store for event-driven data

What is Druid? • “high performance”: low query latency, high

Key features • Column oriented • High concurrency • Scalable

Use cases • Clickstreams, user behavior • Digital advertising •

Powered by Druid 18 Source: http://druid.io/druid-powered.html

Powered by Druid “The performance is great ... some of

20 Integration patterns

Deployment patterns 21 Stream hub (Kafka) Event streams (millions of

Deployment patterns 22 Data lake (Hadoop, S3) File dumps (hourly,

23

Download Druid community site (current): http://druid.io/ Druid community site (new):

Contribute 25 https://github.com/apache/druid

Stay in touch 26 @druidio Join the community! http://druid.io/community Free