Druid: Power Applications to Analyze Sensor Data

Slide 1

Slide 1 text

Druid Power Applications to Analyze Sensor Data Fangjin Yang Cofounder @ Imply

Slide 2

Slide 2 text

Overview History & Motivation Demo Alternative Architectures Druid Architecture Druid & the Data Space

Slide 3

Slide 3 text

History & Motivation First lines of Druid started in 2011 Initial problem: visually explore data - online advertising data - SQL requires expertise - Writing queries is time consuming - Dynamic visualizations not static reports

Slide 4

Slide 4 text

History & Motivation Druid went open source in late 2012 - GPL license initially - Part-time development until early 2014 - Apache v2 licensed in early 2015 Requirements? - “Interactive” (sub-second queries) - “Real-time” (low latency data ingestion) - Scalable (trillions of events/day, petabytes of data) - Multi-tenant (thousands of current users)

Slide 5

Slide 5 text

Demo In case the internet didn’t work, pretend you saw something cool

Slide 6

Slide 6 text

Druid Today Used in production at many different companies big and small Applications have been created for: - Ad-tech - Network traffic - Website traffic - Cloud security - Operations - Activity streams - Finance

Slide 7

Slide 7 text

Powering a Data Application Many possible types of applications, let’s focus on BI Business intelligence/OLAP queries - Time, dimensions, measures - Filtering, grouping, and aggregating data - Not dumping entire data set - Not examining single events - Result set < input set (aggregations)

Slide 8

Slide 8 text

Solution Space Relational databases (MySQL, Postgres) Key/value stores (HBase, Cassandra) Column stores

Slide 9

Slide 9 text

Relational Database Traditional Data Warehouse - Row store - Star schema - Aggregates tables & query caches Fast becoming outdated Slow!

Slide 10

Slide 10 text

Key/Value Stores Pre-computation - Pre-compute every possible query - Pre-compute a set of queries - Exponential scaling costs

Slide 11

Slide 11 text

Key/Value Stores Range scans - Primary key: dimensions/attributes - Value: measures/metrics (things to aggregate) - Still too slow!

Slide 12

Slide 12 text

Key/Value Stores

Slide 13

Slide 13 text

Column stores Load/scan exactly what you need for a query Different compression algorithms for different columns Different indexes for different columns

Slide 14

Slide 14 text

Druid

Slide 15

Slide 15 text

Druid Ideal for powering user-facing analytic applications Supports lots of concurrent reads Custom column format optimized for event data and BI queries Supports extremely fast filters Streaming data ingestion

Slide 16

Slide 16 text

Storage Format

Slide 17

Slide 17 text

Raw data

Slide 18

Slide 18 text

Summarization

Slide 19

Slide 19 text

Summarization

Slide 20

Slide 20 text

Immutable Segments Fundamental storage unit in Druid No contention between reads and writes One thread scans one segment Multiple threads can access same underlying data

Slide 21

Slide 21 text

Druid Multi-tenancy Segment_query_1 Segment_query_2 Segment_query_1 Segment_query_3 Segment_query_2 Segment_query_1 Segment_query_4 Druid Historical Queries Processing Order

Slide 22

Slide 22 text

Columnar Storage Create IDs ● Justin Bieber -> 0, Ke$ha -> 1 Store ● page → [0 0 0 1 1 1] ● language → [0 0 0 0 0 0]

Slide 23

Slide 23 text

Columnar Storage Justin Bieber → [0, 1, 2] → [111000] Ke$ha → [3, 4, 5] → [000111] Justin Bieber OR Ke$ha → [111111] Compression!

Slide 24

Slide 24 text

Custom Columns Create custom logic Create approximate sketches - Hyperloglog - Approximate Histograms - Theta sketches Approximate algorithms are very powerful for fast queries

Slide 25

Slide 25 text

Architecture

Slide 26

Slide 26 text

Architecture (Batch Ingestion)

Slide 27

Slide 27 text

Architecture (Batch Ingestion)

Slide 28

Slide 28 text

Real-time Nodes Write-optimized data structure: hash map in heap Convert write optimized -> read optimized Read-optimized data structure: Druid segments Query data immediately

Slide 29

Slide 29 text

Architecture (Streaming Ingestion)

Slide 30

Slide 30 text

Architecture (Lambda)

Slide 31

Slide 31 text

Querying Query libraries: - JSON over HTTP - SQL - R - Python - Ruby Open source UIs - Pivot (exploratory analytics) - Grafana (dev ops) - Panoramix (reporting)

Slide 32

Slide 32 text

Druid in Production

Slide 33

Slide 33 text

Ingestion >3M events / second sustained (200B+ events/day) 10 – 100k events / second / core

Slide 34

Slide 34 text

Volume Largest known cluster - >500 TB of segments (>50 trillion raw events, >50 PB raw data) Extremely cost effective at scale

Slide 35

Slide 35 text

Queries 500ms average query latency 90% < 1s, 95% < 2S, 99% < 10s

Slide 36

Slide 36 text

Multi-tenancy Several Hundred queries / second Variety of group by & top-K queries

Slide 37

Slide 37 text

Druid & the Data Space

Slide 38

Slide 38 text

The Data Space Big data is hard! Many specialized solutions for different problems Standards are slowly emerging

Slide 39

Slide 39 text

A Real-time Analytics Stack Druid Stream Processor Batch Processor Message bus Events Apps

Slide 40

Slide 40 text

Integration Druid is complementary to many solutions - SQL-on-Hadoop (Hive, Impala, Spark SQL, Drill, Presto) - Stream processors (Storm, Spark streaming, Flink, Samza) - Batch processors (Spark, Hadoop, Flink) - Messages buses (Kafka, RabbitMQ)

Slide 41

Slide 41 text

Druid Community Growing Community - 140+ contributors from many different companies We love contributions! We’re actively seeking committers right now!

Slide 42

Slide 42 text

Takeaway Druid is pretty good for powering applications Druid is pretty good at fast OLAP queries Druid is pretty good at streaming ingestion Druid works well with existing data infrastructure systems

Slide 43

Slide 43 text

Thanks! @implydata @druidio @fangjin imply.io druid.io