Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The rise of operational analytics

26290e7e829b985a6bcb44da8213029e?s=47 Imply
August 09, 2018

The rise of operational analytics

26290e7e829b985a6bcb44da8213029e?s=128

Imply

August 09, 2018
Tweet

Transcript

  1. It slices, it dices, it… drills?! The rise of operational

    analytic data stores Gian Merlino gian@imply.io
  2. Who am I? Gian Merlino Committer & PMC member on

    Cofounder at 10 years working on scalable systems 2
  3. Agenda • The problem • Operational analytics • Under the

    hood • The mysterious future • Do try this at home! 3
  4. The problem 4

  5. The problem 5

  6. The problem 6

  7. The problem • Slice-and-dice for big data • Interactive exploration

    • Look under the hood of reports and dashboards • And we want our data fresh, too 7
  8. Challenges • Scale: when data is large, we need a

    lot of servers • Speed: aiming for sub-second response time • Complexity: too much fine grain to precompute • High dimensionality: 10s or 100s of dimensions • Concurrency: many users and tenants • Freshness: load from streams 8
  9. Motivation • Sub-second responses allow dialogue with data • Rapid

    iteration on questions • Remove barriers to understanding 9
  10. Operational analytics 10

  11. New class of data store • “Operational analytics” or “big

    OLAP” data stores • Examples ◦ Apache Druid [incubating] (open source community) ◦ Scuba (from Facebook) ◦ Pinot (from LinkedIn) ◦ Doris, formerly Palo (from Baidu) ◦ ClickHouse (from Yandex) 11
  12. New class of data store • Column oriented • High

    concurrency • Scalable to 100s of servers, millions of messages/sec • Partition key for query pruning • May or may not have secondary indexes • Query through SQL • Rapid queries on denormalized data 12
  13. Use cases • Clickstreams, user behavior • Application performance management

    • Network flows • IoT • Digital marketing • OLAP / business intelligence 13
  14. 14 high performance analytics data store for event-driven data

  15. What is Druid? • “high performance”: low query latency, high

    ingest rates • “analytics”: counting, ranking, groupBy, time trend • “data store”: the cluster stores a copy of your data • “event-driven data”: fact data like clickstream, network flows, user behavior, digital marketing, server metrics, IoT 15
  16. Powered by Druid 16 Source: http://druid.io/druid-powered.html

  17. Powered by Druid “The performance is great ... some of

    the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second.” 17 Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html From Yahoo:
  18. Key features • Low latency ingestion from Kafka • Bulk

    load from Hadoop • Can pre-aggregate data during ingestion • “Schema light” • Ad-hoc queries • Exact and approximate algorithms • Can keep a lot of history (years are ok) 18
  19. Druid Druid makes interactive data exploration fast and flexible, and

    powers analytic applications. 19
  20. Under the hood 20

  21. Raw data timestamp Action Protocol Flows 2011-01-01T00:01:35Z ACCEPT TCP 10

    2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10 2011-01-01T00:23:15Z ACCEPT TCP 1 2011-01-01T00:38:51Z REJECT UDP 10 2011-01-01T00:49:33Z REJECT TCP 10 2011-01-01T00:49:53Z REJECT TCP 1
  22. Rollup timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 12 2011-01-01T00:00:00Z

    REJECT TCP 22 2011-01-01T00:00:00Z REJECT UDP 30 timestamp Action Protocol Flows 2011-01-01T00:01:35Z ACCEPT TCP 10 2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10 2011-01-01T00:23:15Z ACCEPT TCP 1 2011-01-01T00:38:51Z REJECT UDP 10 2011-01-01T00:49:33Z REJECT TCP 10 2011-01-01T00:49:53Z REJECT TCP 1
  23. Sharding/partitioning data timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 12

    2011-01-01T00:00:00Z REJECT TCP 22 ... 2011-01-01T01:00:00Z ACCEPT TCP 12 2011-01-01T01:00:00Z REJECT TCP 22 ... 2011-01-01T02:00:00Z ACCEPT TCP 12 2011-01-01T02:00:00Z REJECT TCP 22 ... 1st hour segment 2nd hour segment 3rd hour segment
  24. Segments • Fundamental storage unit in Druid • Immutable once

    created • No contention between reads and writes • One thread scans one segment
  25. Columnar storage - compression Create IDs • Accept → 0,

    Reject → 1 • TCP → 0, UDP → 1 Store • Action → [0 0 1 1 1 1] • Protocol → [0 0 1 1 0 0] timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 10 2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10
  26. Columnar storage - fast search and filter ACCEPT → [0,

    1] → [110000] REJECT → [2, 3, 4, 5] → [001111] ACCEPT OR REJECT → [111111] Compression! timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 10 2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10
  27. Approximate algorithms Approximations save storage, memory, and time! • Count

    distinct • Ranking (top N) • Histograms and quantiles • Set operations
  28. Architecture Indexer Indexer Indexer Files Historical Historical Historical Streams Segments

    Broker Broker Queries
  29. The mysterious future 29

  30. Druid roadmap • Parallel loading of data files without Hadoop

    • Automatic compaction • Smaller, faster compression (FastPFOR, etc) • Subtotals, SQL “grouping sets” • SQL standard null handling • Vectorized query engine • Garbage-free expression engine • … your item here!! 30
  31. Try this at home 31

  32. Download Druid community site (current): http://druid.io/ Druid community site (new):

    https://druid.apache.org/ Imply distribution: https://imply.io/get-started 32
  33. Contribute 33 https://github.com/apache/incubator-druid

  34. Stay in touch 34 @druidio http://druid.io/community