Elastic{ON} 2018 - Scaling Log Aggregation At Fitbit

42Lines February 28th, 2018 @bwdezend Scaling Log Aggregation At Fitbit
Breandan Dezendorf, Visibility Engineering

Scaling Log Aggregation At Fitbit 1 Introduction 2 A Series
of Scaling Challenges 3 This Is Not A Spike 4 Onwards To 300,000 logs/s 5 Where We Go Next

• User profile activity • Device sync • 3rd party
integration sync • Internal application and development events What Do We Log?

FITBIT NEVER PUTS HIPAA OR PCI DATA IN ELK

• Release Management and Incident Response • Historical Surveys •
Historical Investigation / Incident Research Three Kinds Of Search

• 35,000 logs/s, peaking at 45,000 logs/s • 6 days
of retention • Between 15 and 18 billion logs on disk Where We Started ~3 Years Ago

The Traditional Log Aggregation Pipeline Shipper 1 2 3 4
5 Message Queue Processor Elasticsearch Kibana

Redis (4) Host System Processor Nodes (4) Logstash Kibana Search
Node (4) Logstash Production Cluster (elk) Master Nodes (3) Data Nodes (30) Messaging Queue Where We Started

• 300,000 logs/s, peaking at 320,000 logs/s • 25 -
30 billion log messages a day • Between 7 and 90 days of retention based on log type • Between 170 and 180 billion logs on disk • Daily kafka throughput of 44 - 48 Tb/day Where We Are Now

Node (4) Logstash Production Cluster (elk) Master Nodes (3) Data Nodes (30) Messaging Queue Where We Started

Host System Kibana Search Node (5) Logstash Logstash Kafka Messaging
Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05a) Processor Nodes (21) Logstash Kafka Messaging Queue Logstash Master Nodes (1) Indexing Nodes (5) Storage Nodes (6) Disaster Recovery Cluster (elk06) Where We Are Now

• Issues with accuracy of indices.breaker.fielddata.limit • Indexing fighting with
search for disk and memory resources • facet searches are expensive in both heap and cpu Scaling And Performance Issues Elasticsearch 1.3.1 / Logstash 1.5.x / Kibana 3.x

• Individual haproxy hosts exceeding logstash capacity • MySQL logs
sometimes contain binary data • Cluster crash recovery took between 5 and 6 hours Scaling And Performance Issues Elasticsearch 1.3.1 / Logstash 1.5.x / Kibana 3.x

This Is Not Acceptable

• Increase retention from 5 days to 30 days •
Double host count for log sources • Ingest double the number of log types from existing hosts • Stop dropping haproxy logs • Handle 2+ years of growth, including spikes of up to 2x traffic The New Cluster

• Replace redis with Apache Kafka • Move to a
tiered hot/warm architecture • Archive logs offsite The New Cluster

Node (4) Logstash Production Cluster (elk) Master Nodes (3) Data Nodes (30) Messaging Queue

Host System Processor Nodes (6) Logstash Kibana Search Node (4)
Logstash Master Nodes (3) Indexing Nodes (25) Storage Nodes (35) Production Cluster (elk02) Messaging Queue Logstash Kafka

• Kafka allows for multiple consumers of the same data
• Logstash has a file output plugin • Dump the logs to disk by tier and type, pigz -9 to compress • AES encrypt and upload to S3 Log Archiving Elasticsearch 1.5.4 / Logstash 1.5.x / Kibana 3.x

• Between 50,000 logs/s and 85,000 logs/s • Crash recovery
was ~3 hours from start to finish The New Cluster (elk02) Elasticsearch 1.5.4 / Logstash 1.5.x / Kibana 3.x

Heraclitus of Ephesus { } Change is the only constant

Typical 7 Day Indexing

Not Typical Anymore

This Isn’t A Spike

• Support arbitrary, sustained ingestion doublings on short notice •
Limit cluster node counts for performance and stability • Move to Elasticsearch 2.x and Kibana 4.x • Handle 2+ years of growth The New, New Cluster Design Goals

Host System Kibana Search Node (4) Logstash Master Nodes (3)
Indexing Nodes (25) Storage Nodes (35) Production Cluster (elk02) Messaging Queue Logstash Kafka Processor Nodes (8) Logstash

Host System Kibana Search Node (4) Logstash Messaging Queue Logstash
Kafka Master Nodes (3) Indexing Nodes (30) Storage Nodes (80) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (80) Production Cluster (elk03a) Processor Nodes (8) Logstash Processor Nodes (8) Logstash

{ } By the way, could you also switch datacenters?

Host System Kibana Search Node (4) Logstash Messaging Queue Logstash
Kafka Master Nodes (3) Indexing Nodes (30) Storage Nodes (80) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (80) Production Cluster (elk03a) Processor Nodes (8) Logstash Processor Nodes (8) Logstash

Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (8) Logstash Processor Nodes (8) Logstash Kafka Messaging Queue

Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (16) Logstash Kafka Messaging Queue

• Each node maintains 13 TCP connections to every other
node • Every node must acknowledge every change in cluster state • ZenDisco timeouts are set for a reason A Word About Cluster Sizes Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x

• Tribe nodes are a full member of all clusters
they connect to • Queries issued to tribe nodes to go to all clusters • Search results are merged and returned to the tribe node • For large clusters, upgrades are no longer possible • Stability is questionable for the tribe node A Word About Tribe Nodes Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x

• Take Bandwidth-Delay Product into account • Older logstash can’t
adjust TCP window sizes • Kafka Mirrormaker has a setting to adjust buffers A Word About Link Latency Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x

Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (16) Logstash Kafka Messaging Queue

Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (16) Logstash Kafka Messaging Queue

• Each host runs logstash indexers that consume from kafka
• Each host runs 3 elasticsearch client nodes • Logstash writes round-robin to nodes via local haproxy Logstash Indexing Nodes

Throwing Away Money .gif

• At 30 days, 170,000 logs/s is 450 billion logs
on disk • Logs are between 673 and 2600 bytes each • Not all logs are created equal • Some can be dropped early E_TOO_MUCH_OPEX Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x

• There are lots of logs ingested that are never
inspected closely • We need them for trending and health checking • We need a consistent portion of the logs Consistent Hashing Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x

• Add a UUID to incoming requests at the load
balancer • Pass the UUIDs along with the request through the stack • Hash the UUID with murmur3 • code => "require 'murmurhash3'; event['request_id_sampling'] = (MurmurHash3::V32.str_hash(event['[request_id]']).to_f/4294967294)” Consistent Hashing Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x

haproxy application database reporting xnutbbq -> .990101 prcbrpa -> .712654
azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800 xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800 xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800 xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800

haproxy application database reporting xnutbbq -> .990101 azqlmik -> .921102
xnutbbq -> .990101 azqlmik -> .921102 xnutbbq -> .990101 azqlmik -> .921102 xnutbbq -> .990101 azqlmik -> .921102

• Setup a small cluster that drops 99% of all
log data • Replicate Kibana dashboards between clusters Simple and Cheap Disaster Recovery Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x

Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (16) Logstash Kafka Messaging Queue

Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (16) Logstash Kafka Messaging Queue Logstash Master Nodes (1) Indexing Nodes (5) Storage Nodes (6) Disaster Recovery Cluster (elk04)

Elastic{ON} 2018 - Scaling Log Aggregation At Fitbit

Elastic{ON} 2018 - Scaling Log Aggregation At Fitbit

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript