Slide 1

Slide 1 text

It slices, it dices!! Building a streaming analytics stack with Kafka and Druid Gian Merlino [email protected]

Slide 2

Slide 2 text

Who am I? Gian Merlino Committer & PMC member on Cofounder at 10 years working on scalable systems 2

Slide 3

Slide 3 text

Agenda ● From warehouses to rivers ● The problem ● Under the hood ● The mysterious future ● Do try this at home! 3

Slide 4

Slide 4 text

Rolling down the river 4

Slide 5

Slide 5 text

Data warehouses Tightly coupled architecture with limited flexibility. 5 Confidential. Do not redistribute. Data Data Data Data Sources ETL Data warehouse Processing Store and Compute Analytics Reporting Data mining Querying

Slide 6

Slide 6 text

Data lakes Modern data architectures are more application-centric. 6 Confidential. Do not redistribute. Data Data Data Data Sources MapReduce, Spark Apps ETL SQL ML/AI TSDB Data lake Storage

Slide 7

Slide 7 text

Data rivers Streaming architectures are true-to-life and enable faster decision cycles. 7 Confidential. Do not redistribute. Data Data Data Data Sources Stream processors Stream hub Real-time analytics ML/AI Archive ETL (Like Kafka) Apps

Slide 8

Slide 8 text

The problem 8

Slide 9

Slide 9 text

The problem 9

Slide 10

Slide 10 text

The problem 10

Slide 11

Slide 11 text

The problem ● Slice-and-dice for big data streams ● Interactive exploration ● Look under the hood of reports and dashboards ● And we want our data fresh, too 11

Slide 12

Slide 12 text

Challenges ● Scale: when data is large, we need a lot of servers ● Speed: aiming for sub-second response time ● Complexity: too much fine grain to precompute ● High dimensionality: 10s or 100s of dimensions ● Concurrency: many users and tenants ● Freshness: load from streams 12

Slide 13

Slide 13 text

Motivation ● Sub-second responses allow dialogue with data ● Rapid iteration on questions ● Remove barriers to understanding 13

Slide 14

Slide 14 text

14 high performance analytics data store for event-driven data

Slide 15

Slide 15 text

What is Druid? ● “high performance”: low query latency, high ingest rates ● “analytics”: counting, ranking, groupBy, time trend ● “data store”: the cluster stores a copy of your data ● “event-driven data”: fact data like clickstream, network flows, user behavior, digital marketing, server metrics, IoT 15

Slide 16

Slide 16 text

Key features ● Column oriented ● High concurrency ● Scalable to 100s of servers, millions of messages/sec ● Indexes on all dimensions by default ● Query through SQL ● Rapid queries on flat tables 16

Slide 17

Slide 17 text

Use cases ● Clickstreams, user behavior ● Digital advertising ● Application performance management ● Network flows ● IoT 17

Slide 18

Slide 18 text

Powered by Druid 18 Source: http://druid.io/druid-powered.html

Slide 19

Slide 19 text

Powered by Druid “The performance is great ... some of the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second.” 19 Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html From Yahoo:

Slide 20

Slide 20 text

20 Integration patterns

Slide 21

Slide 21 text

Deployment patterns 21 Stream hub (Kafka) Event streams (millions of events/sec) ● Modern data architecture ● Centered around stream hub ● Real-time reporting & visibility ● Real-time investigation ● Troubleshooting / diagnostics ● Observability ● Fuse realtime & historical data Enrichment (Kafka Streams, KSQL)

Slide 22

Slide 22 text

Deployment patterns 22 Data lake (Hadoop, S3) File dumps (hourly, daily) ● (Slightly less) modern data architecture ● Centered around data lake ● Low latency, interactive, ad-hoc reporting ● High concurrency Enrichment (Spark, Hive)

Slide 23

Slide 23 text

23

Slide 24

Slide 24 text

Download Druid community site (current): http://druid.io/ Druid community site (new): https://druid.apache.org/ Imply distribution: https://imply.io/get-started 24

Slide 25

Slide 25 text

Contribute 25 https://github.com/apache/druid

Slide 26

Slide 26 text

Stay in touch 26 @druidio Join the community! http://druid.io/community Free training hosted by Imply! https://imply.io/druid-days Follow the Druid project on Twitter!