Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Druid: Power Applications to Analyze Sensor Data

Druid
December 05, 2015

Druid: Power Applications to Analyze Sensor Data

Presented at Strata Singapore 2015

Druid

December 05, 2015
Tweet

More Decks by Druid

Other Decks in Technology

Transcript

  1. Druid
    Power Applications to Analyze Sensor Data
    Fangjin Yang
    Cofounder @ Imply

    View full-size slide

  2. Overview
    History & Motivation
    Demo
    Alternative Architectures
    Druid Architecture
    Druid & the Data Space

    View full-size slide

  3. History & Motivation
    First lines of Druid started in 2011
    Initial problem: visually explore data
    - online advertising data
    - SQL requires expertise
    - Writing queries is time consuming
    - Dynamic visualizations not static reports

    View full-size slide

  4. History & Motivation
    Druid went open source in late 2012
    - GPL license initially
    - Part-time development until early 2014
    - Apache v2 licensed in early 2015
    Requirements?
    - “Interactive” (sub-second queries)
    - “Real-time” (low latency data ingestion)
    - Scalable (trillions of events/day, petabytes of data)
    - Multi-tenant (thousands of current users)

    View full-size slide

  5. Demo
    In case the internet didn’t work,
    pretend you saw something cool

    View full-size slide

  6. Druid Today
    Used in production at many different companies big and small
    Applications have been created for:
    - Ad-tech
    - Network traffic
    - Website traffic
    - Cloud security
    - Operations
    - Activity streams
    - Finance

    View full-size slide

  7. Powering a Data Application
    Many possible types of applications, let’s focus on BI
    Business intelligence/OLAP queries
    - Time, dimensions, measures
    - Filtering, grouping, and aggregating data
    - Not dumping entire data set
    - Not examining single events
    - Result set < input set (aggregations)

    View full-size slide

  8. Solution Space
    Relational databases (MySQL, Postgres)
    Key/value stores (HBase, Cassandra)
    Column stores

    View full-size slide

  9. Relational Database
    Traditional Data Warehouse
    - Row store
    - Star schema
    - Aggregates tables & query caches
    Fast becoming outdated
    Slow!

    View full-size slide

  10. Key/Value Stores
    Pre-computation
    - Pre-compute every possible query
    - Pre-compute a set of queries
    - Exponential scaling costs

    View full-size slide

  11. Key/Value Stores
    Range scans
    - Primary key: dimensions/attributes
    - Value: measures/metrics (things to aggregate)
    - Still too slow!

    View full-size slide

  12. Key/Value Stores

    View full-size slide

  13. Column stores
    Load/scan exactly what you need for a query
    Different compression algorithms for different columns
    Different indexes for different columns

    View full-size slide

  14. Druid
    Ideal for powering user-facing analytic applications
    Supports lots of concurrent reads
    Custom column format optimized for event data and BI queries
    Supports extremely fast filters
    Streaming data ingestion

    View full-size slide

  15. Storage Format

    View full-size slide

  16. Summarization

    View full-size slide

  17. Summarization

    View full-size slide

  18. Immutable Segments
    Fundamental storage unit in Druid
    No contention between reads and writes
    One thread scans one segment
    Multiple threads can access same underlying data

    View full-size slide

  19. Druid Multi-tenancy
    Segment_query_1
    Segment_query_2
    Segment_query_1
    Segment_query_3
    Segment_query_2
    Segment_query_1
    Segment_query_4
    Druid Historical
    Queries
    Processing
    Order

    View full-size slide

  20. Columnar Storage
    Create IDs
    ● Justin Bieber -> 0, Ke$ha -> 1
    Store
    ● page → [0 0 0 1 1 1]
    ● language → [0 0 0 0 0 0]

    View full-size slide

  21. Columnar Storage
    Justin Bieber → [0, 1, 2] → [111000]
    Ke$ha → [3, 4, 5] → [000111]
    Justin Bieber OR Ke$ha → [111111]
    Compression!

    View full-size slide

  22. Custom Columns
    Create custom logic
    Create approximate sketches
    - Hyperloglog
    - Approximate Histograms
    - Theta sketches
    Approximate algorithms are very powerful for fast queries

    View full-size slide

  23. Architecture

    View full-size slide

  24. Architecture (Batch Ingestion)

    View full-size slide

  25. Architecture (Batch Ingestion)

    View full-size slide

  26. Real-time Nodes
    Write-optimized data structure: hash map in heap
    Convert write optimized -> read optimized
    Read-optimized data structure: Druid segments
    Query data immediately

    View full-size slide

  27. Architecture (Streaming Ingestion)

    View full-size slide

  28. Architecture (Lambda)

    View full-size slide

  29. Querying
    Query libraries:
    - JSON over HTTP
    - SQL
    - R
    - Python
    - Ruby
    Open source UIs
    - Pivot (exploratory analytics)
    - Grafana (dev ops)
    - Panoramix (reporting)

    View full-size slide

  30. Druid in Production

    View full-size slide

  31. Ingestion
    >3M events / second sustained (200B+ events/day)
    10 – 100k events / second / core

    View full-size slide

  32. Volume
    Largest known cluster
    - >500 TB of segments (>50 trillion raw events, >50 PB raw data)
    Extremely cost effective at scale

    View full-size slide

  33. Queries
    500ms average query latency
    90% < 1s, 95% < 2S, 99% < 10s

    View full-size slide

  34. Multi-tenancy
    Several Hundred queries / second
    Variety of group by & top-K queries

    View full-size slide

  35. Druid & the Data Space

    View full-size slide

  36. The Data Space
    Big data is hard! Many specialized solutions for different problems
    Standards are slowly emerging

    View full-size slide

  37. A Real-time Analytics Stack
    Druid
    Stream
    Processor
    Batch
    Processor
    Message bus
    Events Apps

    View full-size slide

  38. Integration
    Druid is complementary to many solutions
    - SQL-on-Hadoop (Hive, Impala, Spark SQL, Drill, Presto)
    - Stream processors (Storm, Spark streaming, Flink, Samza)
    - Batch processors (Spark, Hadoop, Flink)
    - Messages buses (Kafka, RabbitMQ)

    View full-size slide

  39. Druid Community
    Growing Community
    - 140+ contributors from many different companies
    We love contributions!
    We’re actively seeking committers right now!

    View full-size slide

  40. Takeaway
    Druid is pretty good for powering applications
    Druid is pretty good at fast OLAP queries
    Druid is pretty good at streaming ingestion
    Druid works well with existing data infrastructure systems

    View full-size slide

  41. Thanks!
    @implydata
    @druidio
    @fangjin
    imply.io
    druid.io

    View full-size slide