How many unique male visitors visited my website last month • How much revenue was generated last quarter broken down by a demographic • Not dumping an entire data set • Not querying for an individual event • Cost effective (we are a startup after all)
~ 40 PB of raw data • Over 200 TB of compressed query-able data • Ingesting over 300,000 events/sec on average • Average query time 500ms • 90% queries under 1 second • 99% queries under 10 seconds
/ second / core 1 day of summarized aggregates 60M+ rows 1 query over 1 week, 16 cores ~ 5 seconds Page load with 20 queries over a week of data ……. long time
on primary key • Inflexible • not aggregated, not available • Not continuously updated • Dimensional combinations & Processing => scales exponentially • Example: ~ 500k records • 11 dimensions : 4.5 hours on a 15-node Hadoop Cluster • 14 dimensions : 9 hours on a 25-node hadoop cluster
events in a write-optimized data structure • Periodically persist collected events to disk (converting to a read- optimized format) • Query data as soon as it is ingested • Merges all intermediate segments and hands them over to historical nodes
query performance • Data without tools to analyze it is useless. • WAIT… We have Druid. Use Druid to monitor Druid!! • > 10TB of metrics data in Druid • Interactive exploration of performance metrics allows to pinpoint problems quickly • Narrow problems down to individual query and server • Provides both big picture and the detailed breakdown
More Data -> More Nodes -> More Failures & More Cost!! • Throwing money at the problem only a short term solution • Some piece always fails to scale • Startup means daily operations handled by dev team
from metrics cluster • Percentage of data queried at any given time is small • Query load across the nodes is NOT uniform • Users really look only for recent data interactively using dashboard (3 Months) • Users run quarterly reports (Non-Interactive scripts) • Large queries create bottlenecks and resource contention • 20% of users take 80% of resources
simple • Keeping all data in memory is expensive • Percentage of data queries at any given time is small • Memory management is hard, let OS handle paging • Flexible configuration - control how much to page • Cost vs Performance becomes a simple dial • Use SSDs to mitigate performance impact (still cheaper than RAM)
that isn’t required for queries saves cost • Memory is still critical for performance • Cost of decompressing data present in RAM << cost of paging data from Disk • On-the-fly decompression is fast with recent algorithms (LZF, Snappy, LZ4)
uniform • Greedily rebalance based on cost heuristics • Avoid co-locating recent or overlapping data • Favor co-locating data for different customer • Distribute data likely to be queries together
sometimes (during failures) • Replication can become expensive • Availability is IMPORTANT • Trade-off performance for cost during failures • Move replica to COLD tier • Keep a single replica in HOT tier
tool optimized for the type of queries you will make • If none of the existing tools solve your problem, build it. • Understand your USERS • Analyze query patterns • Use cases should define the product • Tradeoffs are everywhere • Performance vs Cost (in-memory, tiering, compression) • Latency vs throughput (streaming vs batch ingestion) • Monitor everything