Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a streaming analytics stack with Kafka and Druid

January 22, 2019

Building a streaming analytics stack with Kafka and Druid

The maturation and development of open source technologies has made it easier than ever for companies to derive insights from vast quantities of data. In this talk, we will cover how data analytic stacks have evolved from data warehouses, to data lakes, and to more modern stream-oriented analytic stacks. We will also discuss building such a stack using Apache Kafka and Apache Druid.


January 22, 2019

More Decks by Imply

Other Decks in Technology


  1. Who am I? Gian Merlino Committer & PMC member on

    Cofounder at 10 years working on scalable systems 2
  2. Agenda • From warehouses to rivers • The problem •

    Under the hood • The mysterious future • Do try this at home! 3
  3. Data warehouses Tightly coupled architecture with limited flexibility. 5 Confidential.

    Do not redistribute. Data Data Data Data Sources ETL Data warehouse Processing Store and Compute Analytics Reporting Data mining Querying
  4. Data lakes Modern data architectures are more application-centric. 6 Confidential.

    Do not redistribute. Data Data Data Data Sources MapReduce, Spark Apps ETL SQL ML/AI TSDB Data lake Storage
  5. Data rivers Streaming architectures are true-to-life and enable faster decision

    cycles. 7 Confidential. Do not redistribute. Data Data Data Data Sources Stream processors Stream hub Real-time analytics ML/AI Archive ETL (Like Kafka) Apps
  6. The problem • Slice-and-dice for big data streams • Interactive

    exploration • Look under the hood of reports and dashboards • And we want our data fresh, too 11
  7. Challenges • Scale: when data is large, we need a

    lot of servers • Speed: aiming for sub-second response time • Complexity: too much fine grain to precompute • High dimensionality: 10s or 100s of dimensions • Concurrency: many users and tenants • Freshness: load from streams 12
  8. Motivation • Sub-second responses allow dialogue with data • Rapid

    iteration on questions • Remove barriers to understanding 13
  9. What is Druid? • “high performance”: low query latency, high

    ingest rates • “analytics”: counting, ranking, groupBy, time trend • “data store”: the cluster stores a copy of your data • “event-driven data”: fact data like clickstream, network flows, user behavior, digital marketing, server metrics, IoT 15
  10. Key features • Column oriented • High concurrency • Scalable

    to 100s of servers, millions of messages/sec • Indexes on all dimensions by default • Query through SQL • Rapid queries on flat tables 16
  11. Use cases • Clickstreams, user behavior • Digital advertising •

    Application performance management • Network flows • IoT 17
  12. Powered by Druid “The performance is great ... some of

    the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second.” 19 Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html From Yahoo:
  13. Deployment patterns 21 Stream hub (Kafka) Event streams (millions of

    events/sec) • Modern data architecture • Centered around stream hub • Real-time reporting & visibility • Real-time investigation • Troubleshooting / diagnostics • Observability • Fuse realtime & historical data Enrichment (Kafka Streams, KSQL)
  14. Deployment patterns 22 Data lake (Hadoop, S3) File dumps (hourly,

    daily) • (Slightly less) modern data architecture • Centered around data lake • Low latency, interactive, ad-hoc reporting • High concurrency Enrichment (Spark, Hive)
  15. 23

  16. Download Druid community site (current): http://druid.io/ Druid community site (new):

    https://druid.apache.org/ Imply distribution: https://imply.io/get-started 24
  17. Stay in touch 26 @druidio Join the community! http://druid.io/community Free

    training hosted by Imply! https://imply.io/druid-days Follow the Druid project on Twitter!