Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Druid: A High Performance, Column-oriented, Distributed Data Store

Imply
October 07, 2016

Druid: A High Performance, Column-oriented, Distributed Data Store

Imply

October 07, 2016
Tweet

More Decks by Imply

Other Decks in Technology

Transcript

  1. History & Motivation First lines of Druid started in 2011

    Initial use case: power ad-tech analytics product Requirements: - Scalable (trillions of events/day, petabytes of data) - Multi-tenant (thousands of current users) - Interactive (low latency queries) - “Real-time” (low latency data ingestion)
  2. History & Motivation Druid went open source in late 2012

    - GPL license initially - Part-time development until early 2014 Community growth - Apache v2 licensed in early 2015 - 150+ contributors from 100+ organizations In production at many different companies across many verticals - Ad-tech, network traffic, security, finance, gaming, operations, activity streams, etc.
  3. Use Cases Powering user-facing analytic applications Unify historical and real-time

    events Business intelligence/OLAP queries (slice and dice and drill into data) Behavioral analysis (measuring distinct counts, retention analysis, funnel analysis, A/B testing) Exploratory analytics/root cause analysis
  4. Business Intelligence Queries Event data - Time, dimensions (attributes), measures

    Business intelligence/OLAP queries - “How much revenue did product X generate last quarter in SF”? - “How many of my users that visited last week returned this week?” - Not dumping entire data set - Not examining single events - Filtering, grouping, and aggregating data - Result set < input set (aggregations)
  5. Relational Database Traditional Data Warehouse - Row oriented - Star

    schema - Aggregates tables & query caches Fast becoming outdated Slow!
  6. Key/Value Stores Range scans - Primary key: dimensions/attributes - Value:

    measures/metrics (things to aggregate) - Still too slow!
  7. Column stores Load/scan exactly what you need for a query

    Different compression algorithms for different columns - Encoding for string columns - Compression for measure columns Different indexes for different columns
  8. Druid Custom column format optimized for event data and BI

    queries Supports lots of concurrent reads Streaming data ingestion Supports extremely fast filters Ideal for powering user-facing analytic applications
  9. Raw data timestamp domain gender clicked 2011-01-01T00:01:35Z bieber.com Female 1

    2011-01-01T00:03:03Z bieber.com Female 0 2011-01-01T00:04:51Z ultra.com Male 1 2011-01-01T00:05:33Z ultra.com Male 1 2011-01-01T00:05:53Z ultra.com Female 0 2011-01-01T00:06:17Z ultra.com Female 1 2011-01-01T00:23:15Z bieber.com Female 0 2011-01-01T00:38:51Z ultra.com Male 1 2011-01-01T00:49:33Z bieber.com Female 1 2011-01-01T00:49:53Z ultra.com Female 0
  10. Summarization timestamp domain gender clicked 2011-01-01T00:00:00Z bieber.com Female 1 2011-01-01T00:00:00Z

    ultra.com Female 2 2011-01-01T00:00:00Z ultra.com Male 3 timestamp domain gender clicked 2011-01-01T00:01:35Z bieber.com Female 1 2011-01-01T00:03:03Z bieber.com Female 0 2011-01-01T00:04:51Z ultra.com Male 1 2011-01-01T00:05:33Z ultra.com Male 1 2011-01-01T00:05:53Z ultra.com Female 0 2011-01-01T00:06:17Z ultra.com Female 1 2011-01-01T00:23:15Z bieber.com Female 0 2011-01-01T00:38:51Z ultra.com Male 1 2011-01-01T00:49:33Z bieber.com Female 1 2011-01-01T00:49:53Z ultra.com Female 0
  11. Columnar Storage Create IDs • Justin Bieber -> 0, Ke$ha

    -> 1 Store • page → [0 0 0 1 1 1] • language → [0 0 0 0 0 0]
  12. Columnar Storage Justin Bieber → [0, 1, 2] → [111000]

    Ke$ha → [3, 4, 5] → [000111] Justin Bieber OR Ke$ha → [111111] Compression!
  13. Plugin Architecture Write your own plugins for different computations and

    components Often used for approximate algorithms - Count distinct (Hyperloglog) - Approximate Histograms - Funnel/behavioral analysis (theta sketches) Approximate algorithms are very powerful for fast queries
  14. Approximate Algorithms timestamp domain gender clicked 2011-01-01T00:00:00Z bieber.com Female 1

    2011-01-01T00:00:00Z ultra.com Female 2 2011-01-01T00:00:00Z ultra.com Male 3 timestamp domain gender clicked 2011-01-01T00:01:35Z bieber.com Female 1 2011-01-01T00:03:03Z bieber.com Female 0 2011-01-01T00:04:51Z ultra.com Male 1 2011-01-01T00:05:33Z ultra.com Male 1 2011-01-01T00:05:53Z ultra.com Female 0 2011-01-01T00:06:17Z ultra.com Female 1 2011-01-01T00:23:15Z bieber.com Female 0 2011-01-01T00:38:51Z ultra.com Male 1 2011-01-01T00:49:33Z bieber.com Female 1 2011-01-01T00:49:53Z ultra.com Female 0
  15. Approximate Algorithms timestamp domain user gender clicked 2011-01-01T00:01:35Z bieber.com 4312345532

    Female 1 2011-01-01T00:03:03Z bieber.com 3484920241 Female 0 2011-01-01T00:04:51Z ultra.com 9530174728 Male 1 2011-01-01T00:05:33Z ultra.com 4098310573 Male 1 2011-01-01T00:05:53Z ultra.com 5832058870 Female 0 2011-01-01T00:06:17Z ultra.com 5789283478 Female 1 2011-01-01T00:23:15Z bieber.com 4730093842 Female 0 2011-01-01T00:38:51Z ultra.com 9530174728 Male 1 2011-01-01T00:49:33Z bieber.com 4930097162 Female 1 2011-01-01T00:49:53Z ultra.com 3081837193 Female 0
  16. Approximate Algorithms timestamp domain user gender clicked 2011-01-01T00:01:35Z bieber.com 4312345532

    Female 1 2011-01-01T00:03:03Z bieber.com 3484920241 Female 0 2011-01-01T00:04:51Z ultra.com 9530174728 Male 1 2011-01-01T00:05:33Z ultra.com 4098310573 Male 1 2011-01-01T00:05:53Z ultra.com 5832058870 Female 0 2011-01-01T00:06:17Z ultra.com 5789283478 Female 1 2011-01-01T00:23:15Z bieber.com 4730093842 Female 0 2011-01-01T00:38:51Z ultra.com 9530174728 Male 1 2011-01-01T00:49:33Z bieber.com 4930097162 Female 1 2011-01-01T00:49:53Z ultra.com 3081837193 Female 0 timestamp domain gender clicked users 2011-01-01T00:00:00Z bieber.com Female 1 {sketch_data structure} 2011-01-01T00:00:00Z ultra.com Female 2 {sketch_data_structure} 2011-01-01T00:00:00Z ultra.com Male 3 {sketch_data_structure}
  17. Real-time Nodes Write-optimized data structure: hash map in heap Convert

    write optimized -> read optimized Read-optimized data structure: Druid segments Query data immediately
  18. Querying Query libraries: - JSON over HTTP - SQL -

    R - Python - Ruby Open source UIs - Pivot - Grafana - Caravel
  19. Volume Largest known cluster - >500 TB of segments (>50

    trillion raw events, >50 PB raw data) Extremely cost effective at scale
  20. Integration Druid is complementary to many solutions - SQL-on-Hadoop (Hive,

    Impala, Spark SQL, Drill, Presto) - Stream processors (Storm, Spark streaming, Flink, Samza) - Batch processors (Spark, Hadoop, Flink) - Messages buses (Kafka, RabbitMQ)
  21. Takeaway Druid is pretty good for analytic applications Druid is

    pretty good at fast OLAP queries Druid is pretty good at streaming ingestion Druid works well with existing data infrastructure systems