The rise of operational analytics

It slices, it dices, it… drills?! The rise of operational
analytic data stores Gian Merlino [email protected]

Who am I? Gian Merlino Committer & PMC member on
Cofounder at 10 years working on scalable systems 2

Agenda • The problem • Operational analytics • Under the
hood • The mysterious future • Do try this at home! 3

The problem 4

The problem 5

The problem 6

The problem • Slice-and-dice for big data • Interactive exploration
• Look under the hood of reports and dashboards • And we want our data fresh, too 7

Challenges • Scale: when data is large, we need a
lot of servers • Speed: aiming for sub-second response time • Complexity: too much fine grain to precompute • High dimensionality: 10s or 100s of dimensions • Concurrency: many users and tenants • Freshness: load from streams 8

Motivation • Sub-second responses allow dialogue with data • Rapid
iteration on questions • Remove barriers to understanding 9

Operational analytics 10

New class of data store • “Operational analytics” or “big
OLAP” data stores • Examples ◦ Apache Druid [incubating] (open source community) ◦ Scuba (from Facebook) ◦ Pinot (from LinkedIn) ◦ Doris, formerly Palo (from Baidu) ◦ ClickHouse (from Yandex) 11

New class of data store • Column oriented • High
concurrency • Scalable to 100s of servers, millions of messages/sec • Partition key for query pruning • May or may not have secondary indexes • Query through SQL • Rapid queries on denormalized data 12

Use cases • Clickstreams, user behavior • Application performance management
• Network flows • IoT • Digital marketing • OLAP / business intelligence 13

14 high performance analytics data store for event-driven data

What is Druid? • “high performance”: low query latency, high
ingest rates • “analytics”: counting, ranking, groupBy, time trend • “data store”: the cluster stores a copy of your data • “event-driven data”: fact data like clickstream, network flows, user behavior, digital marketing, server metrics, IoT 15

Powered by Druid 16 Source: http://druid.io/druid-powered.html

Powered by Druid “The performance is great ... some of
the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second.” 17 Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html From Yahoo:

Key features • Low latency ingestion from Kafka • Bulk
load from Hadoop • Can pre-aggregate data during ingestion • “Schema light” • Ad-hoc queries • Exact and approximate algorithms • Can keep a lot of history (years are ok) 18

Druid Druid makes interactive data exploration fast and flexible, and
powers analytic applications. 19

Under the hood 20

Raw data timestamp Action Protocol Flows 2011-01-01T00:01:35Z ACCEPT TCP 10
2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10 2011-01-01T00:23:15Z ACCEPT TCP 1 2011-01-01T00:38:51Z REJECT UDP 10 2011-01-01T00:49:33Z REJECT TCP 10 2011-01-01T00:49:53Z REJECT TCP 1

Rollup timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 12 2011-01-01T00:00:00Z
REJECT TCP 22 2011-01-01T00:00:00Z REJECT UDP 30 timestamp Action Protocol Flows 2011-01-01T00:01:35Z ACCEPT TCP 10 2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10 2011-01-01T00:23:15Z ACCEPT TCP 1 2011-01-01T00:38:51Z REJECT UDP 10 2011-01-01T00:49:33Z REJECT TCP 10 2011-01-01T00:49:53Z REJECT TCP 1

Sharding/partitioning data timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 12
2011-01-01T00:00:00Z REJECT TCP 22 ... 2011-01-01T01:00:00Z ACCEPT TCP 12 2011-01-01T01:00:00Z REJECT TCP 22 ... 2011-01-01T02:00:00Z ACCEPT TCP 12 2011-01-01T02:00:00Z REJECT TCP 22 ... 1st hour segment 2nd hour segment 3rd hour segment

Segments • Fundamental storage unit in Druid • Immutable once
created • No contention between reads and writes • One thread scans one segment

Columnar storage - compression Create IDs • Accept → 0,
Reject → 1 • TCP → 0, UDP → 1 Store • Action → [0 0 1 1 1 1] • Protocol → [0 0 1 1 0 0] timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 10 2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10

Columnar storage - fast search and filter ACCEPT → [0,
1] → [110000] REJECT → [2, 3, 4, 5] → [001111] ACCEPT OR REJECT → [111111] Compression! timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 10 2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10

Approximate algorithms Approximations save storage, memory, and time! • Count
distinct • Ranking (top N) • Histograms and quantiles • Set operations

Architecture Indexer Indexer Indexer Files Historical Historical Historical Streams Segments
Broker Broker Queries

The mysterious future 29

Druid roadmap • Parallel loading of data files without Hadoop
• Automatic compaction • Smaller, faster compression (FastPFOR, etc) • Subtotals, SQL “grouping sets” • SQL standard null handling • Vectorized query engine • Garbage-free expression engine • … your item here!! 30

Try this at home 31

Download Druid community site (current): http://druid.io/ Druid community site (new):
https://druid.apache.org/ Imply distribution: https://imply.io/get-started 32

Contribute 33 https://github.com/apache/incubator-druid

Stay in touch 34 @druidio http://druid.io/community

The rise of operational analytics

The rise of operational analytics

More Decks by Imply

Other Decks in Technology

Featured

Transcript