Real-time analytics with Druid at Appsflyer

Meet Druid! Real-time analytics with Druid at Appsflyer

Publisher Click Install Appsflyer Flow Advertiser

Appsflyer as Marketing Platform Fraud detection Statistics Attribution Life time
value Retargeting Prediction A/B testing

Appsflyer Technology • ~8B events / day • Hundreds of
machines in Amazon • Tens of micro-services Apache Kafka service service service service service service DB Amazon S3 MongoDB Redshift Druid

Realtime • New buzzword • Ingestion latency - seconds •
Query latency - seconds

Analytics • Roll-up ◦ Summarizing over a dimension • Drill-down
◦ Focusing (zooming in) • Slicing and dicing ◦ Reducing dimensions (slice) ◦ Picking values of specific dimensions (dice) • Pivoting ◦ Rotating multi-dimensional cube

Analytics in 3D

We tried... • MongoDB ◦ Operational issues ◦ Performance is
not great • Redshift ◦ Concurrency limits • Aurora (MySQL) ◦ Aggregations are not optimized • Memsql ◦ Insufficient performance ◦ Too pricy • Cassandra ◦ Not flexible

Druid • Storage optimized for analytics • Lambda architecture inside
• JSON-based query language • Developed by analytics SAAS company • Free and open source • Scalable to petabytes...

Druid Storage • Columnar • Inverted index • Immutable segments

Columnar Storage Original data: 100MB Queried columns: 10MB Compressed: 3MB

Index • Values are dictionary encoded {“USA” -> 1, “Canada”
-> 2, “Mexico” -> 3, …} • Bitmap for every dimension value (used by filters) “USA” -> [0, 1, 0, 0, 1, 1, 0, 0, 0] • Column values (used by aggregation queries) [2, 1, 3, 15, 1, 1, 2, 8, 7]

Data Segments • Per time interval ◦ Skip segments when
querying • Immutable ◦ Cache friendly ◦ No locking • Versioned (MVCC) ◦ No locking ◦ Read-write concurrency

Data Ingestion Real-time Data Historical Data Broker Streaming Hand-off Batch
indexing Query

Real-time Ingestion • Via Real-Time Node and Firehose ◦ No
redundancy or HA, thus not recommended • Via Indexing Service and Tranquility API ◦ Core API ◦ Integrations with Streaming Frameworks ◦ HTTP Server ◦ Kafka Consumer

Batch Ingestion • File based (HDFS, S3, …) • Indexers
◦ Internal Indexer ▪ For datasets < 1G ◦ External Hadoop Cluster ◦ Spark Indexer ▪ Work in progress

Ingestion Spec • Parsing configuration (Flat JSON, *SV) • Dimensions
• Metrics • Granularity ◦ Segment granularity ◦ Query granularity • I/O configuration ◦ Where to read data from • Tuning configuration ◦ Indexer tuning • Partitioning and replication

Real-time ingestion Task 1 Task 2 Interval Window Time Minimum
indexing slots = Data sources x Partitions x Replicas x 2

Query Types • Group by ◦ grouping by multiple dimensions
• Top N ◦ like grouping by a single dimension • Timeseries ◦ w/o grouping over dimensions • Search ◦ Dimensions lookup • Time boundary ◦ Find available data timeframe • Metadata queries

Tips for Querying • Prefer topN over groupBy • Prefer
timeseries over topN and groupBy • Use limits (and priorities)

Query Spec • Data source • Dimensions • Interval •
Filters • Aggregations • Post aggregations • Granularity • Context (query configuration) • Limit

Sample Query ~# curl -X POST [email protected] -H "Content-Type: application/json"
http://druidbroker:8082/druid/v2?pretty { "queryType": "groupBy", "dataSource": "inappevents", "granularity": "hour", "dimensions": ["media_source", "campaign"], "filter": { "type": "and", "fields": [{ "type": "selector", "dimension": "app_id", "value": "com.comuto" }, { "type": "selector", "dimension": "country", "value": "RU" }] }, "aggregations": [ { "type": "count", "name": "events_count" }, { "type": "doubleSum", "name": "revenue", "fieldName": "monetary" } ], "intervals": [ "2015-12-01T00:00:00.000/2016-01-01T00:00:00.000" ] }

Caching • Historical node level ◦ By segment • Broker
level ◦ By segment and query ◦ “groupBy” is disabled on purpose! • By default - local caching • In production - use memcached

Load Rules • Can be defined ◦ On data source
◦ On “tier” • What can be set ◦ Replication factor ◦ Load period ◦ Drop period • Can be used to separate “hot” data from “cold” one

Druid Components Historical Nodes Real-time Nodes Coordinator Middle Manager Overlord
Indexing Service Broker Nodes Deep Storage Metadata Storage

Druid Components Historical Nodes Real-time Nodes Coordinator Middle Manager Overlord
Indexing Service Broker Nodes Deep Storage Metadata Storage Cache Load Balancer

Druid Components (Explained) • Coordinator ◦ Manages segments • Real-time
Nodes ◦ Pulling data in real-time, and indexing it • Historical Nodes ◦ Keeps historical segments • Overlord ◦ Accepts tasks and distributes them to Middle Managers • Middle Manager ◦ Executes submitted tasks via Peons • Broker Nodes ◦ Routes query to Real-time and Historical nodes, merges results • Deep Storage ◦ Segments backup (HDFS, S3, …)

Failover • Coordinator and Overlord ◦ HA • Real-time nodes
◦ Tasks are replicated ◦ Pool of nodes • Historical nodes ◦ Data is replicated ◦ Pool of nodes ◦ All segments are backed up in the deep storage • Brokers ◦ Pool of nodes ◦ Load balancer at the front

Druid at Appsflyer Druid Sink S3 S3

Druid Sink Druid Sink Tranquility API Probably not needed anymore
due to native support in Tranquility package

Druid in Production • Provisioning using Chef • r3.8xlarge (sample
configuration is OK) • Redundancy for coordinator and overlord (node per AZ) • Historical and real-time nodes are spread between AZ • LB - Consul from Hashicorp • Service discovery - Consul again • Memcached • Monitoring via Graphite Emitter extension ◦ https://github.com/druid-io/druid/pull/1978 • Alerting via Sensu

IAP Distribution • 3 different node types (instead of 6)
• Unpack and run • Some useful wrappers • Built-in examples for quick start • Commercial support • PyQL, Pivot inside http://imply.io

Tips • ZooKeeper is heavily used ◦ Choose appropriate hardware/network
for ZK machines • Use latest version (0.8.3) ◦ Restartable tasks ◦ Indexing time improvement! (https://github.com/druid-io/druid/pull/1960) ◦ Data sketches library • All exceptions are useful

When Not to Choose Druid? • When data is not
time-series • When data cardinality is high • When number of output rows is high • When setup costs must be avoided

Non-time Series Workarounds • Must have some timestamp still •
Rebuild everything to order by your timestamp • Or, use single-dimension partitioning ◦ Segments partitioned by timestamp first, then by dimension range ◦ Find optimal target segment size Still, please don’t use Druid for non-time series!

Tools: Pivot

Tools: Panoramix

Thank you!

Real-time analytics with Druid at Appsflyer

Real-time analytics with Druid at Appsflyer

More Decks by AppsFlyer

Other Decks in Technology

Featured

Transcript