Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-time analytics with Druid at Appsflyer

Real-time analytics with Druid at Appsflyer

Presentation from the first-ever Druid meetup in Israel
http://meetup.com/Druid-Israel/events/229123558/

AppsFlyer

March 16, 2016
Tweet

More Decks by AppsFlyer

Other Decks in Technology

Transcript

  1. Meet Druid!
    Real-time analytics with Druid at
    Appsflyer

    View Slide

  2. Publisher
    Click
    Install
    Appsflyer Flow
    Advertiser

    View Slide

  3. Appsflyer as Marketing Platform
    Fraud detection
    Statistics
    Attribution
    Life time value
    Retargeting
    Prediction
    A/B testing

    View Slide

  4. Appsflyer Technology
    ● ~8B events / day
    ● Hundreds of machines in Amazon
    ● Tens of micro-services
    Apache Kafka
    service
    service
    service
    service
    service
    service
    DB
    Amazon S3
    MongoDB
    Redshift
    Druid

    View Slide

  5. Realtime
    ● New buzzword
    ● Ingestion latency - seconds
    ● Query latency - seconds

    View Slide

  6. Analytics
    ● Roll-up
    ○ Summarizing over a dimension
    ● Drill-down
    ○ Focusing (zooming in)
    ● Slicing and dicing
    ○ Reducing dimensions (slice)
    ○ Picking values of specific dimensions (dice)
    ● Pivoting
    ○ Rotating multi-dimensional cube

    View Slide

  7. Analytics in 3D

    View Slide

  8. We tried...
    ● MongoDB
    ○ Operational issues
    ○ Performance is not great
    ● Redshift
    ○ Concurrency limits
    ● Aurora (MySQL)
    ○ Aggregations are not optimized
    ● Memsql
    ○ Insufficient performance
    ○ Too pricy
    ● Cassandra
    ○ Not flexible

    View Slide

  9. Druid
    ● Storage optimized for analytics
    ● Lambda architecture inside
    ● JSON-based query language
    ● Developed by analytics SAAS company
    ● Free and open source
    ● Scalable to petabytes...

    View Slide

  10. Druid Storage
    ● Columnar
    ● Inverted index
    ● Immutable segments

    View Slide

  11. Columnar Storage
    Original data: 100MB
    Queried columns:
    10MB
    Compressed:
    3MB

    View Slide

  12. Index
    ● Values are dictionary encoded
    {“USA” -> 1, “Canada” -> 2, “Mexico” -> 3, …}
    ● Bitmap for every dimension value (used by filters)
    “USA” -> [0, 1, 0, 0, 1, 1, 0, 0, 0]
    ● Column values (used by aggregation queries)
    [2, 1, 3, 15, 1, 1, 2, 8, 7]

    View Slide

  13. Data Segments
    ● Per time interval
    ○ Skip segments when querying
    ● Immutable
    ○ Cache friendly
    ○ No locking
    ● Versioned (MVCC)
    ○ No locking
    ○ Read-write concurrency

    View Slide

  14. Data Ingestion
    Real-time Data Historical Data
    Broker
    Streaming Hand-off
    Batch indexing
    Query

    View Slide

  15. Real-time Ingestion
    ● Via Real-Time Node and Firehose
    ○ No redundancy or HA, thus not recommended
    ● Via Indexing Service and Tranquility API
    ○ Core API
    ○ Integrations with Streaming Frameworks
    ○ HTTP Server
    ○ Kafka Consumer

    View Slide

  16. Batch Ingestion
    ● File based (HDFS, S3, …)
    ● Indexers
    ○ Internal Indexer
    ■ For datasets < 1G
    ○ External Hadoop Cluster
    ○ Spark Indexer
    ■ Work in progress

    View Slide

  17. Ingestion Spec
    ● Parsing configuration (Flat JSON, *SV)
    ● Dimensions
    ● Metrics
    ● Granularity
    ○ Segment granularity
    ○ Query granularity
    ● I/O configuration
    ○ Where to read data from
    ● Tuning configuration
    ○ Indexer tuning
    ● Partitioning and replication

    View Slide

  18. Real-time ingestion
    Task 1
    Task 2
    Interval Window
    Time
    Minimum indexing slots = Data sources x Partitions x Replicas x 2

    View Slide

  19. Query Types
    ● Group by
    ○ grouping by multiple dimensions
    ● Top N
    ○ like grouping by a single dimension
    ● Timeseries
    ○ w/o grouping over dimensions
    ● Search
    ○ Dimensions lookup
    ● Time boundary
    ○ Find available data timeframe
    ● Metadata queries

    View Slide

  20. Tips for Querying
    ● Prefer topN over groupBy
    ● Prefer timeseries over topN and groupBy
    ● Use limits (and priorities)

    View Slide

  21. Query Spec
    ● Data source
    ● Dimensions
    ● Interval
    ● Filters
    ● Aggregations
    ● Post aggregations
    ● Granularity
    ● Context (query configuration)
    ● Limit

    View Slide

  22. Sample Query
    ~# curl -X POST [email protected] -H "Content-Type: application/json" http://druidbroker:8082/druid/v2?pretty
    {
    "queryType": "groupBy",
    "dataSource": "inappevents",
    "granularity": "hour",
    "dimensions": ["media_source", "campaign"],
    "filter": {
    "type": "and", "fields": [{ "type": "selector", "dimension": "app_id", "value": "com.comuto" },
    { "type": "selector", "dimension": "country", "value": "RU" }]
    },
    "aggregations": [
    { "type": "count", "name": "events_count" },
    { "type": "doubleSum", "name": "revenue", "fieldName": "monetary" }
    ],
    "intervals": [ "2015-12-01T00:00:00.000/2016-01-01T00:00:00.000" ]
    }

    View Slide

  23. Caching
    ● Historical node level
    ○ By segment
    ● Broker level
    ○ By segment and query
    ○ “groupBy” is disabled on purpose!
    ● By default - local caching
    ● In production - use memcached

    View Slide

  24. Load Rules
    ● Can be defined
    ○ On data source
    ○ On “tier”
    ● What can be set
    ○ Replication factor
    ○ Load period
    ○ Drop period
    ● Can be used to separate “hot” data from “cold” one

    View Slide

  25. Druid Components
    Historical Nodes
    Real-time Nodes
    Coordinator
    Middle Manager
    Overlord
    Indexing Service
    Broker Nodes
    Deep
    Storage
    Metadata Storage

    View Slide

  26. Druid Components
    Historical Nodes
    Real-time Nodes
    Coordinator
    Middle Manager
    Overlord
    Indexing Service
    Broker Nodes
    Deep
    Storage
    Metadata Storage
    Cache
    Load Balancer

    View Slide

  27. Druid Components (Explained)
    ● Coordinator
    ○ Manages segments
    ● Real-time Nodes
    ○ Pulling data in real-time, and indexing it
    ● Historical Nodes
    ○ Keeps historical segments
    ● Overlord
    ○ Accepts tasks and distributes them to Middle Managers
    ● Middle Manager
    ○ Executes submitted tasks via Peons
    ● Broker Nodes
    ○ Routes query to Real-time and Historical nodes, merges results
    ● Deep Storage
    ○ Segments backup (HDFS, S3, …)

    View Slide

  28. Failover
    ● Coordinator and Overlord
    ○ HA
    ● Real-time nodes
    ○ Tasks are replicated
    ○ Pool of nodes
    ● Historical nodes
    ○ Data is replicated
    ○ Pool of nodes
    ○ All segments are backed up in the deep storage
    ● Brokers
    ○ Pool of nodes
    ○ Load balancer at the front

    View Slide

  29. Druid at Appsflyer
    Druid Sink
    S3
    S3

    View Slide

  30. Druid Sink
    Druid Sink
    Tranquility API
    Probably not needed anymore due to native support in Tranquility package

    View Slide

  31. Druid in Production
    ● Provisioning using Chef
    ● r3.8xlarge (sample configuration is OK)
    ● Redundancy for coordinator and overlord (node per AZ)
    ● Historical and real-time nodes are spread between AZ
    ● LB - Consul from Hashicorp
    ● Service discovery - Consul again
    ● Memcached
    ● Monitoring via Graphite Emitter extension
    ○ https://github.com/druid-io/druid/pull/1978
    ● Alerting via Sensu

    View Slide

  32. IAP Distribution
    ● 3 different node types (instead of 6)
    ● Unpack and run
    ● Some useful wrappers
    ● Built-in examples for quick start
    ● Commercial support
    ● PyQL, Pivot inside
    http://imply.io

    View Slide

  33. Tips
    ● ZooKeeper is heavily used
    ○ Choose appropriate hardware/network for ZK machines
    ● Use latest version (0.8.3)
    ○ Restartable tasks
    ○ Indexing time improvement! (https://github.com/druid-io/druid/pull/1960)
    ○ Data sketches library
    ● All exceptions are useful

    View Slide

  34. When Not to Choose Druid?
    ● When data is not time-series
    ● When data cardinality is high
    ● When number of output rows is high
    ● When setup costs must be avoided

    View Slide

  35. Non-time Series Workarounds
    ● Must have some timestamp still
    ● Rebuild everything to order by your timestamp
    ● Or, use single-dimension partitioning
    ○ Segments partitioned by timestamp first, then by dimension range
    ○ Find optimal target segment size
    Still, please don’t use Druid for non-time series!

    View Slide

  36. Tools: Pivot

    View Slide

  37. Tools: Panoramix

    View Slide

  38. Thank you!

    View Slide