Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Druid: Power Applications to Analyze Sensor Data

Druid
December 05, 2015

Druid: Power Applications to Analyze Sensor Data

Presented at Strata Singapore 2015

Druid

December 05, 2015
Tweet

More Decks by Druid

Other Decks in Technology

Transcript

  1. Druid
    Power Applications to Analyze Sensor Data
    Fangjin Yang
    Cofounder @ Imply

    View Slide

  2. Overview
    History & Motivation
    Demo
    Alternative Architectures
    Druid Architecture
    Druid & the Data Space

    View Slide

  3. History & Motivation
    First lines of Druid started in 2011
    Initial problem: visually explore data
    - online advertising data
    - SQL requires expertise
    - Writing queries is time consuming
    - Dynamic visualizations not static reports

    View Slide

  4. History & Motivation
    Druid went open source in late 2012
    - GPL license initially
    - Part-time development until early 2014
    - Apache v2 licensed in early 2015
    Requirements?
    - “Interactive” (sub-second queries)
    - “Real-time” (low latency data ingestion)
    - Scalable (trillions of events/day, petabytes of data)
    - Multi-tenant (thousands of current users)

    View Slide

  5. Demo
    In case the internet didn’t work,
    pretend you saw something cool

    View Slide

  6. Druid Today
    Used in production at many different companies big and small
    Applications have been created for:
    - Ad-tech
    - Network traffic
    - Website traffic
    - Cloud security
    - Operations
    - Activity streams
    - Finance

    View Slide

  7. Powering a Data Application
    Many possible types of applications, let’s focus on BI
    Business intelligence/OLAP queries
    - Time, dimensions, measures
    - Filtering, grouping, and aggregating data
    - Not dumping entire data set
    - Not examining single events
    - Result set < input set (aggregations)

    View Slide

  8. Solution Space
    Relational databases (MySQL, Postgres)
    Key/value stores (HBase, Cassandra)
    Column stores

    View Slide

  9. Relational Database
    Traditional Data Warehouse
    - Row store
    - Star schema
    - Aggregates tables & query caches
    Fast becoming outdated
    Slow!

    View Slide

  10. Key/Value Stores
    Pre-computation
    - Pre-compute every possible query
    - Pre-compute a set of queries
    - Exponential scaling costs

    View Slide

  11. Key/Value Stores
    Range scans
    - Primary key: dimensions/attributes
    - Value: measures/metrics (things to aggregate)
    - Still too slow!

    View Slide

  12. Key/Value Stores

    View Slide

  13. Column stores
    Load/scan exactly what you need for a query
    Different compression algorithms for different columns
    Different indexes for different columns

    View Slide

  14. Druid

    View Slide

  15. Druid
    Ideal for powering user-facing analytic applications
    Supports lots of concurrent reads
    Custom column format optimized for event data and BI queries
    Supports extremely fast filters
    Streaming data ingestion

    View Slide

  16. Storage Format

    View Slide

  17. Raw data

    View Slide

  18. Summarization

    View Slide

  19. Summarization

    View Slide

  20. Immutable Segments
    Fundamental storage unit in Druid
    No contention between reads and writes
    One thread scans one segment
    Multiple threads can access same underlying data

    View Slide

  21. Druid Multi-tenancy
    Segment_query_1
    Segment_query_2
    Segment_query_1
    Segment_query_3
    Segment_query_2
    Segment_query_1
    Segment_query_4
    Druid Historical
    Queries
    Processing
    Order

    View Slide

  22. Columnar Storage
    Create IDs
    ● Justin Bieber -> 0, Ke$ha -> 1
    Store
    ● page → [0 0 0 1 1 1]
    ● language → [0 0 0 0 0 0]

    View Slide

  23. Columnar Storage
    Justin Bieber → [0, 1, 2] → [111000]
    Ke$ha → [3, 4, 5] → [000111]
    Justin Bieber OR Ke$ha → [111111]
    Compression!

    View Slide

  24. Custom Columns
    Create custom logic
    Create approximate sketches
    - Hyperloglog
    - Approximate Histograms
    - Theta sketches
    Approximate algorithms are very powerful for fast queries

    View Slide

  25. Architecture

    View Slide

  26. Architecture (Batch Ingestion)

    View Slide

  27. Architecture (Batch Ingestion)

    View Slide

  28. Real-time Nodes
    Write-optimized data structure: hash map in heap
    Convert write optimized -> read optimized
    Read-optimized data structure: Druid segments
    Query data immediately

    View Slide

  29. Architecture (Streaming Ingestion)

    View Slide

  30. Architecture (Lambda)

    View Slide

  31. Querying
    Query libraries:
    - JSON over HTTP
    - SQL
    - R
    - Python
    - Ruby
    Open source UIs
    - Pivot (exploratory analytics)
    - Grafana (dev ops)
    - Panoramix (reporting)

    View Slide

  32. Druid in Production

    View Slide

  33. Ingestion
    >3M events / second sustained (200B+ events/day)
    10 – 100k events / second / core

    View Slide

  34. Volume
    Largest known cluster
    - >500 TB of segments (>50 trillion raw events, >50 PB raw data)
    Extremely cost effective at scale

    View Slide

  35. Queries
    500ms average query latency
    90% < 1s, 95% < 2S, 99% < 10s

    View Slide

  36. Multi-tenancy
    Several Hundred queries / second
    Variety of group by & top-K queries

    View Slide

  37. Druid & the Data Space

    View Slide

  38. The Data Space
    Big data is hard! Many specialized solutions for different problems
    Standards are slowly emerging

    View Slide

  39. A Real-time Analytics Stack
    Druid
    Stream
    Processor
    Batch
    Processor
    Message bus
    Events Apps

    View Slide

  40. Integration
    Druid is complementary to many solutions
    - SQL-on-Hadoop (Hive, Impala, Spark SQL, Drill, Presto)
    - Stream processors (Storm, Spark streaming, Flink, Samza)
    - Batch processors (Spark, Hadoop, Flink)
    - Messages buses (Kafka, RabbitMQ)

    View Slide

  41. Druid Community
    Growing Community
    - 140+ contributors from many different companies
    We love contributions!
    We’re actively seeking committers right now!

    View Slide

  42. Takeaway
    Druid is pretty good for powering applications
    Druid is pretty good at fast OLAP queries
    Druid is pretty good at streaming ingestion
    Druid works well with existing data infrastructure systems

    View Slide

  43. Thanks!
    @implydata
    @druidio
    @fangjin
    imply.io
    druid.io

    View Slide