Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The rise of operational analytics

Imply
August 09, 2018

The rise of operational analytics

Imply

August 09, 2018
Tweet

More Decks by Imply

Other Decks in Technology

Transcript

  1. Who am I? Gian Merlino Committer & PMC member on

    Cofounder at 10 years working on scalable systems 2
  2. Agenda • The problem • Operational analytics • Under the

    hood • The mysterious future • Do try this at home! 3
  3. The problem • Slice-and-dice for big data • Interactive exploration

    • Look under the hood of reports and dashboards • And we want our data fresh, too 7
  4. Challenges • Scale: when data is large, we need a

    lot of servers • Speed: aiming for sub-second response time • Complexity: too much fine grain to precompute • High dimensionality: 10s or 100s of dimensions • Concurrency: many users and tenants • Freshness: load from streams 8
  5. Motivation • Sub-second responses allow dialogue with data • Rapid

    iteration on questions • Remove barriers to understanding 9
  6. New class of data store • “Operational analytics” or “big

    OLAP” data stores • Examples ◦ Apache Druid [incubating] (open source community) ◦ Scuba (from Facebook) ◦ Pinot (from LinkedIn) ◦ Doris, formerly Palo (from Baidu) ◦ ClickHouse (from Yandex) 11
  7. New class of data store • Column oriented • High

    concurrency • Scalable to 100s of servers, millions of messages/sec • Partition key for query pruning • May or may not have secondary indexes • Query through SQL • Rapid queries on denormalized data 12
  8. Use cases • Clickstreams, user behavior • Application performance management

    • Network flows • IoT • Digital marketing • OLAP / business intelligence 13
  9. What is Druid? • “high performance”: low query latency, high

    ingest rates • “analytics”: counting, ranking, groupBy, time trend • “data store”: the cluster stores a copy of your data • “event-driven data”: fact data like clickstream, network flows, user behavior, digital marketing, server metrics, IoT 15
  10. Powered by Druid “The performance is great ... some of

    the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second.” 17 Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html From Yahoo:
  11. Key features • Low latency ingestion from Kafka • Bulk

    load from Hadoop • Can pre-aggregate data during ingestion • “Schema light” • Ad-hoc queries • Exact and approximate algorithms • Can keep a lot of history (years are ok) 18
  12. Raw data timestamp Action Protocol Flows 2011-01-01T00:01:35Z ACCEPT TCP 10

    2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10 2011-01-01T00:23:15Z ACCEPT TCP 1 2011-01-01T00:38:51Z REJECT UDP 10 2011-01-01T00:49:33Z REJECT TCP 10 2011-01-01T00:49:53Z REJECT TCP 1
  13. Rollup timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 12 2011-01-01T00:00:00Z

    REJECT TCP 22 2011-01-01T00:00:00Z REJECT UDP 30 timestamp Action Protocol Flows 2011-01-01T00:01:35Z ACCEPT TCP 10 2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10 2011-01-01T00:23:15Z ACCEPT TCP 1 2011-01-01T00:38:51Z REJECT UDP 10 2011-01-01T00:49:33Z REJECT TCP 10 2011-01-01T00:49:53Z REJECT TCP 1
  14. Sharding/partitioning data timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 12

    2011-01-01T00:00:00Z REJECT TCP 22 ... 2011-01-01T01:00:00Z ACCEPT TCP 12 2011-01-01T01:00:00Z REJECT TCP 22 ... 2011-01-01T02:00:00Z ACCEPT TCP 12 2011-01-01T02:00:00Z REJECT TCP 22 ... 1st hour segment 2nd hour segment 3rd hour segment
  15. Segments • Fundamental storage unit in Druid • Immutable once

    created • No contention between reads and writes • One thread scans one segment
  16. Columnar storage - compression Create IDs • Accept → 0,

    Reject → 1 • TCP → 0, UDP → 1 Store • Action → [0 0 1 1 1 1] • Protocol → [0 0 1 1 0 0] timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 10 2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10
  17. Columnar storage - fast search and filter ACCEPT → [0,

    1] → [110000] REJECT → [2, 3, 4, 5] → [001111] ACCEPT OR REJECT → [111111] Compression! timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 10 2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10
  18. Approximate algorithms Approximations save storage, memory, and time! • Count

    distinct • Ranking (top N) • Histograms and quantiles • Set operations
  19. Druid roadmap • Parallel loading of data files without Hadoop

    • Automatic compaction • Smaller, faster compression (FastPFOR, etc) • Subtotals, SQL “grouping sets” • SQL standard null handling • Vectorized query engine • Garbage-free expression engine • … your item here!! 30
  20. Download Druid community site (current): http://druid.io/ Druid community site (new):

    https://druid.apache.org/ Imply distribution: https://imply.io/get-started 32