Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elastic{ON} 2018 - Scaling Log Aggregation At Fitbit

Elastic Co
March 01, 2018

Elastic{ON} 2018 - Scaling Log Aggregation At Fitbit

Elastic Co

March 01, 2018
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Scaling Log Aggregation At Fitbit 1 Introduction 2 A Series

    of Scaling Challenges 3 This Is Not A Spike 4 Onwards To 300,000 logs/s 5 Where We Go Next
  2. • User profile activity • Device sync • 3rd party

    integration sync • Internal application and development events What Do We Log?
  3. • Release Management and Incident Response • Historical Surveys •

    Historical Investigation / Incident Research Three Kinds Of Search
  4. • 35,000 logs/s, peaking at 45,000 logs/s • 6 days

    of retention • Between 15 and 18 billion logs on disk Where We Started ~3 Years Ago
  5. The Traditional Log Aggregation Pipeline Shipper 1 2 3 4

    5 Message Queue Processor Elasticsearch Kibana
  6. Redis (4) Host System Processor Nodes (4) Logstash Kibana Search

    Node (4) Logstash Production Cluster (elk) Master Nodes (3) Data Nodes (30) Messaging Queue Where We Started
  7. • 300,000 logs/s, peaking at 320,000 logs/s • 25 -

    30 billion log messages a day • Between 7 and 90 days of retention based on log type • Between 170 and 180 billion logs on disk • Daily kafka throughput of 44 - 48 Tb/day Where We Are Now
  8. Redis (4) Host System Processor Nodes (4) Logstash Kibana Search

    Node (4) Logstash Production Cluster (elk) Master Nodes (3) Data Nodes (30) Messaging Queue Where We Started
  9. Host System Kibana Search Node (5) Logstash Logstash Kafka Messaging

    Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05a) Processor Nodes (21) Logstash Kafka Messaging Queue Logstash Master Nodes (1) Indexing Nodes (5) Storage Nodes (6) Disaster Recovery Cluster (elk06) Where We Are Now
  10. • Issues with accuracy of indices.breaker.fielddata.limit • Indexing fighting with

    search for disk and memory resources • facet searches are expensive in both heap and cpu Scaling And Performance Issues Elasticsearch 1.3.1 / Logstash 1.5.x / Kibana 3.x
  11. • Individual haproxy hosts exceeding logstash capacity • MySQL logs

    sometimes contain binary data • Cluster crash recovery took between 5 and 6 hours Scaling And Performance Issues Elasticsearch 1.3.1 / Logstash 1.5.x / Kibana 3.x
  12. • Increase retention from 5 days to 30 days •

    Double host count for log sources • Ingest double the number of log types from existing hosts • Stop dropping haproxy logs • Handle 2+ years of growth, including spikes of up to 2x traffic The New Cluster
  13. • Replace redis with Apache Kafka • Move to a

    tiered hot/warm architecture • Archive logs offsite The New Cluster
  14. Redis (4) Host System Processor Nodes (4) Logstash Kibana Search

    Node (4) Logstash Production Cluster (elk) Master Nodes (3) Data Nodes (30) Messaging Queue
  15. Host System Processor Nodes (6) Logstash Kibana Search Node (4)

    Logstash Master Nodes (3) Indexing Nodes (25) Storage Nodes (35) Production Cluster (elk02) Messaging Queue Logstash Kafka
  16. • Kafka allows for multiple consumers of the same data

    • Logstash has a file output plugin • Dump the logs to disk by tier and type, pigz -9 to compress • AES encrypt and upload to S3 Log Archiving Elasticsearch 1.5.4 / Logstash 1.5.x / Kibana 3.x
  17. • Between 50,000 logs/s and 85,000 logs/s • Crash recovery

    was ~3 hours from start to finish The New Cluster (elk02) Elasticsearch 1.5.4 / Logstash 1.5.x / Kibana 3.x
  18. • Support arbitrary, sustained ingestion doublings on short notice •

    Limit cluster node counts for performance and stability • Move to Elasticsearch 2.x and Kibana 4.x • Handle 2+ years of growth The New, New Cluster Design Goals
  19. Host System Kibana Search Node (4) Logstash Master Nodes (3)

    Indexing Nodes (25) Storage Nodes (35) Production Cluster (elk02) Messaging Queue Logstash Kafka Processor Nodes (8) Logstash
  20. Host System Kibana Search Node (4) Logstash Messaging Queue Logstash

    Kafka Master Nodes (3) Indexing Nodes (30) Storage Nodes (80) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (80) Production Cluster (elk03a) Processor Nodes (8) Logstash Processor Nodes (8) Logstash
  21. Host System Kibana Search Node (4) Logstash Messaging Queue Logstash

    Kafka Master Nodes (3) Indexing Nodes (30) Storage Nodes (80) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (80) Production Cluster (elk03a) Processor Nodes (8) Logstash Processor Nodes (8) Logstash
  22. Host System Kibana Search Node (4) Logstash Logstash Kafka Messaging

    Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (8) Logstash Processor Nodes (8) Logstash Kafka Messaging Queue
  23. Host System Kibana Search Node (4) Logstash Logstash Kafka Messaging

    Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (16) Logstash Kafka Messaging Queue
  24. • Each node maintains 13 TCP connections to every other

    node • Every node must acknowledge every change in cluster state • ZenDisco timeouts are set for a reason A Word About Cluster Sizes Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x
  25. • Tribe nodes are a full member of all clusters

    they connect to • Queries issued to tribe nodes to go to all clusters • Search results are merged and returned to the tribe node • For large clusters, upgrades are no longer possible • Stability is questionable for the tribe node A Word About Tribe Nodes Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x
  26. • Take Bandwidth-Delay Product into account • Older logstash can’t

    adjust TCP window sizes • Kafka Mirrormaker has a setting to adjust buffers A Word About Link Latency Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x
  27. Host System Kibana Search Node (4) Logstash Logstash Kafka Messaging

    Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (16) Logstash Kafka Messaging Queue
  28. Host System Kibana Search Node (4) Logstash Logstash Kafka Messaging

    Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (16) Logstash Kafka Messaging Queue
  29. • Each host runs logstash indexers that consume from kafka

    • Each host runs 3 elasticsearch client nodes • Logstash writes round-robin to nodes via local haproxy Logstash Indexing Nodes
  30. • At 30 days, 170,000 logs/s is 450 billion logs

    on disk • Logs are between 673 and 2600 bytes each • Not all logs are created equal • Some can be dropped early E_TOO_MUCH_OPEX Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x
  31. • There are lots of logs ingested that are never

    inspected closely • We need them for trending and health checking • We need a consistent portion of the logs Consistent Hashing Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x
  32. • Add a UUID to incoming requests at the load

    balancer • Pass the UUIDs along with the request through the stack • Hash the UUID with murmur3 • code => "require 'murmurhash3'; event['request_id_sampling'] = (MurmurHash3::V32.str_hash(event['[request_id]']).to_f/4294967294)” Consistent Hashing Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x
  33. haproxy application database reporting xnutbbq -> .990101 prcbrpa -> .712654

    azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800 xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800 xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800 xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800
  34. haproxy application database reporting xnutbbq -> .990101 prcbrpa -> .712654

    azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800 xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800 xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800 xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800
  35. haproxy application database reporting xnutbbq -> .990101 azqlmik -> .921102

    xnutbbq -> .990101 azqlmik -> .921102 xnutbbq -> .990101 azqlmik -> .921102 xnutbbq -> .990101 azqlmik -> .921102
  36. haproxy application database reporting xnutbbq -> .990101 azqlmik -> .921102

    xnutbbq -> .990101 azqlmik -> .921102 xnutbbq -> .990101 azqlmik -> .921102 xnutbbq -> .990101 azqlmik -> .921102
  37. • Setup a small cluster that drops 99% of all

    log data • Replicate Kibana dashboards between clusters Simple and Cheap Disaster Recovery Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x
  38. Host System Kibana Search Node (4) Logstash Logstash Kafka Messaging

    Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (16) Logstash Kafka Messaging Queue
  39. Host System Kibana Search Node (4) Logstash Logstash Kafka Messaging

    Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (16) Logstash Kafka Messaging Queue Logstash Master Nodes (1) Indexing Nodes (5) Storage Nodes (6) Disaster Recovery Cluster (elk04)
  40. • Most afternoons, ingestion performance dropped • No apparent reason

    despite instrumentation and investigation • CPU time, disk I/O, network interfaces all normal Lag Du Jour
  41. • refresh_interval controls how often lucene merges happen • rebalance_threshold

    controls how “balanced” the cluster is • bulk queue depth applies back pressure to logstash when elasticsearch is busy Lag Du Jour
  42. • Classic hotspot problem where one node underperforms • Fix

    by increasing refresh_interval • Fix by enforcing total_shards_per_node Lag Du Jour
  43. • Leading wildcard searches trigger a full table scan •

    This means reading all fields of all documents • With hundreds of billions of logs in indexes, this can take… time • indices.query.query_string.allowLeadingWildcard : false Leading Wildcard Searches
  44. Host System Kibana Search Node (5) Logstash Logstash Kafka Messaging

    Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05a) Processor Nodes (21) Logstash Kafka Messaging Queue Logstash Master Nodes (1) Indexing Nodes (5) Storage Nodes (6) Disaster Recovery Cluster (elk06)
  45. • Running in a new datacenter, on slightly newer hardware

    • Apache Kafka 0.10.1 • Elasticsearch, Logstash, and Kibana up to date* Elasticsearch 5.5.0
  46. • When we started, a full cluster restart took ~6

    hours • Now, a full restart of three clusters takes less than 45 minutes • More reliable, faster and more resilient to failure • > 380,000 logs/s during stall recovery Where We Are Now
  47. Host System Kibana Search Node (5) Logstash Logstash Kafka Messaging

    Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05a) Processor Nodes (21) Logstash Kafka Messaging Queue Logstash Master Nodes (1) Indexing Nodes (5) Storage Nodes (6) Disaster Recovery Cluster (elk06)
  48. • Unique count of items in a set • Elasticsearch

    uses HyperLogLog • 5.3.x has a (since fixed) bug where search result buckets are allocated before circuit breakers are evaluated Testing Is Important Cardinality Aggregations and You
  49. • Elasticsearch 6.x is going to require a lot of

    testing • Support > 400,000 logs/s in 2018 • Explore containerized deployment • Reduce use of custom tooling, move to _rollover and curator Where We Go Next
  50. Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/

    Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. Please attribute Elastic with a link to elastic.co