Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a streaming analytics stack with Kafka and Druid

26290e7e829b985a6bcb44da8213029e?s=47 Imply
January 22, 2019

Building a streaming analytics stack with Kafka and Druid

The maturation and development of open source technologies has made it easier than ever for companies to derive insights from vast quantities of data. In this talk, we will cover how data analytic stacks have evolved from data warehouses, to data lakes, and to more modern stream-oriented analytic stacks. We will also discuss building such a stack using Apache Kafka and Apache Druid.

26290e7e829b985a6bcb44da8213029e?s=128

Imply

January 22, 2019
Tweet

Transcript

  1. It slices, it dices!! Building a streaming analytics stack with

    Kafka and Druid Gian Merlino gian@imply.io
  2. Who am I? Gian Merlino Committer & PMC member on

    Cofounder at 10 years working on scalable systems 2
  3. Agenda • From warehouses to rivers • The problem •

    Under the hood • The mysterious future • Do try this at home! 3
  4. Rolling down the river 4

  5. Data warehouses Tightly coupled architecture with limited flexibility. 5 Confidential.

    Do not redistribute. Data Data Data Data Sources ETL Data warehouse Processing Store and Compute Analytics Reporting Data mining Querying
  6. Data lakes Modern data architectures are more application-centric. 6 Confidential.

    Do not redistribute. Data Data Data Data Sources MapReduce, Spark Apps ETL SQL ML/AI TSDB Data lake Storage
  7. Data rivers Streaming architectures are true-to-life and enable faster decision

    cycles. 7 Confidential. Do not redistribute. Data Data Data Data Sources Stream processors Stream hub Real-time analytics ML/AI Archive ETL (Like Kafka) Apps
  8. The problem 8

  9. The problem 9

  10. The problem 10

  11. The problem • Slice-and-dice for big data streams • Interactive

    exploration • Look under the hood of reports and dashboards • And we want our data fresh, too 11
  12. Challenges • Scale: when data is large, we need a

    lot of servers • Speed: aiming for sub-second response time • Complexity: too much fine grain to precompute • High dimensionality: 10s or 100s of dimensions • Concurrency: many users and tenants • Freshness: load from streams 12
  13. Motivation • Sub-second responses allow dialogue with data • Rapid

    iteration on questions • Remove barriers to understanding 13
  14. 14 high performance analytics data store for event-driven data

  15. What is Druid? • “high performance”: low query latency, high

    ingest rates • “analytics”: counting, ranking, groupBy, time trend • “data store”: the cluster stores a copy of your data • “event-driven data”: fact data like clickstream, network flows, user behavior, digital marketing, server metrics, IoT 15
  16. Key features • Column oriented • High concurrency • Scalable

    to 100s of servers, millions of messages/sec • Indexes on all dimensions by default • Query through SQL • Rapid queries on flat tables 16
  17. Use cases • Clickstreams, user behavior • Digital advertising •

    Application performance management • Network flows • IoT 17
  18. Powered by Druid 18 Source: http://druid.io/druid-powered.html

  19. Powered by Druid “The performance is great ... some of

    the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second.” 19 Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html From Yahoo:
  20. 20 Integration patterns

  21. Deployment patterns 21 Stream hub (Kafka) Event streams (millions of

    events/sec) • Modern data architecture • Centered around stream hub • Real-time reporting & visibility • Real-time investigation • Troubleshooting / diagnostics • Observability • Fuse realtime & historical data Enrichment (Kafka Streams, KSQL)
  22. Deployment patterns 22 Data lake (Hadoop, S3) File dumps (hourly,

    daily) • (Slightly less) modern data architecture • Centered around data lake • Low latency, interactive, ad-hoc reporting • High concurrency Enrichment (Spark, Hive)
  23. 23

  24. Download Druid community site (current): http://druid.io/ Druid community site (new):

    https://druid.apache.org/ Imply distribution: https://imply.io/get-started 24
  25. Contribute 25 https://github.com/apache/druid

  26. Stay in touch 26 @druidio Join the community! http://druid.io/community Free

    training hosted by Imply! https://imply.io/druid-days Follow the Druid project on Twitter!