Slide 1

Slide 1 text

42Lines February 28th, 2018 @bwdezend Scaling Log Aggregation At Fitbit Breandan Dezendorf, Visibility Engineering

Slide 2

Slide 2 text

Scaling Log Aggregation At Fitbit 1 Introduction 2 A Series of Scaling Challenges 3 This Is Not A Spike 4 Onwards To 300,000 logs/s 5 Where We Go Next

Slide 3

Slide 3 text

• User profile activity • Device sync • 3rd party integration sync • Internal application and development events What Do We Log?

Slide 4

Slide 4 text

FITBIT NEVER PUTS HIPAA OR PCI DATA IN ELK

Slide 5

Slide 5 text

• Release Management and Incident Response • Historical Surveys • Historical Investigation / Incident Research Three Kinds Of Search

Slide 6

Slide 6 text

• 35,000 logs/s, peaking at 45,000 logs/s • 6 days of retention • Between 15 and 18 billion logs on disk Where We Started ~3 Years Ago

Slide 7

Slide 7 text

The Traditional Log Aggregation Pipeline Shipper 1 2 3 4 5 Message Queue Processor Elasticsearch Kibana

Slide 8

Slide 8 text

Redis (4) Host System Processor Nodes (4) Logstash Kibana Search Node (4) Logstash Production Cluster (elk) Master Nodes (3) Data Nodes (30) Messaging Queue Where We Started

Slide 9

Slide 9 text

• 300,000 logs/s, peaking at 320,000 logs/s • 25 - 30 billion log messages a day • Between 7 and 90 days of retention based on log type • Between 170 and 180 billion logs on disk • Daily kafka throughput of 44 - 48 Tb/day Where We Are Now

Slide 10

Slide 10 text

Redis (4) Host System Processor Nodes (4) Logstash Kibana Search Node (4) Logstash Production Cluster (elk) Master Nodes (3) Data Nodes (30) Messaging Queue Where We Started

Slide 11

Slide 11 text

Host System Kibana Search Node (5) Logstash Logstash Kafka Messaging Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05a) Processor Nodes (21) Logstash Kafka Messaging Queue Logstash Master Nodes (1) Indexing Nodes (5) Storage Nodes (6) Disaster Recovery Cluster (elk06) Where We Are Now

Slide 12

Slide 12 text

• Issues with accuracy of indices.breaker.fielddata.limit • Indexing fighting with search for disk and memory resources • facet searches are expensive in both heap and cpu Scaling And Performance Issues Elasticsearch 1.3.1 / Logstash 1.5.x / Kibana 3.x

Slide 13

Slide 13 text

• Individual haproxy hosts exceeding logstash capacity • MySQL logs sometimes contain binary data • Cluster crash recovery took between 5 and 6 hours Scaling And Performance Issues Elasticsearch 1.3.1 / Logstash 1.5.x / Kibana 3.x

Slide 14

Slide 14 text

This Is Not Acceptable

Slide 15

Slide 15 text

• Increase retention from 5 days to 30 days • Double host count for log sources • Ingest double the number of log types from existing hosts • Stop dropping haproxy logs • Handle 2+ years of growth, including spikes of up to 2x traffic The New Cluster

Slide 16

Slide 16 text

• Replace redis with Apache Kafka • Move to a tiered hot/warm architecture • Archive logs offsite The New Cluster

Slide 17

Slide 17 text

Redis (4) Host System Processor Nodes (4) Logstash Kibana Search Node (4) Logstash Production Cluster (elk) Master Nodes (3) Data Nodes (30) Messaging Queue

Slide 18

Slide 18 text

Host System Processor Nodes (6) Logstash Kibana Search Node (4) Logstash Master Nodes (3) Indexing Nodes (25) Storage Nodes (35) Production Cluster (elk02) Messaging Queue Logstash Kafka

Slide 19

Slide 19 text

• Kafka allows for multiple consumers of the same data • Logstash has a file output plugin • Dump the logs to disk by tier and type, pigz -9 to compress • AES encrypt and upload to S3 Log Archiving Elasticsearch 1.5.4 / Logstash 1.5.x / Kibana 3.x

Slide 20

Slide 20 text

• Between 50,000 logs/s and 85,000 logs/s • Crash recovery was ~3 hours from start to finish The New Cluster (elk02) Elasticsearch 1.5.4 / Logstash 1.5.x / Kibana 3.x

Slide 21

Slide 21 text

Heraclitus of Ephesus { } Change is the only constant

Slide 22

Slide 22 text

Typical 7 Day Indexing

Slide 23

Slide 23 text

Not Typical Anymore

Slide 24

Slide 24 text

This Isn’t A Spike

Slide 25

Slide 25 text

• Support arbitrary, sustained ingestion doublings on short notice • Limit cluster node counts for performance and stability • Move to Elasticsearch 2.x and Kibana 4.x • Handle 2+ years of growth The New, New Cluster Design Goals

Slide 26

Slide 26 text

Host System Kibana Search Node (4) Logstash Master Nodes (3) Indexing Nodes (25) Storage Nodes (35) Production Cluster (elk02) Messaging Queue Logstash Kafka Processor Nodes (8) Logstash

Slide 27

Slide 27 text

Host System Kibana Search Node (4) Logstash Messaging Queue Logstash Kafka Master Nodes (3) Indexing Nodes (30) Storage Nodes (80) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (80) Production Cluster (elk03a) Processor Nodes (8) Logstash Processor Nodes (8) Logstash

Slide 28

Slide 28 text

{ } By the way, could you also switch datacenters?

Slide 29

Slide 29 text

Host System Kibana Search Node (4) Logstash Messaging Queue Logstash Kafka Master Nodes (3) Indexing Nodes (30) Storage Nodes (80) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (80) Production Cluster (elk03a) Processor Nodes (8) Logstash Processor Nodes (8) Logstash

Slide 30

Slide 30 text

Host System Kibana Search Node (4) Logstash Logstash Kafka Messaging Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (8) Logstash Processor Nodes (8) Logstash Kafka Messaging Queue

Slide 31

Slide 31 text

Host System Kibana Search Node (4) Logstash Logstash Kafka Messaging Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (16) Logstash Kafka Messaging Queue

Slide 32

Slide 32 text

• Each node maintains 13 TCP connections to every other node • Every node must acknowledge every change in cluster state • ZenDisco timeouts are set for a reason A Word About Cluster Sizes Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x

Slide 33

Slide 33 text

• Tribe nodes are a full member of all clusters they connect to • Queries issued to tribe nodes to go to all clusters • Search results are merged and returned to the tribe node • For large clusters, upgrades are no longer possible • Stability is questionable for the tribe node A Word About Tribe Nodes Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x

Slide 34

Slide 34 text

• Take Bandwidth-Delay Product into account • Older logstash can’t adjust TCP window sizes • Kafka Mirrormaker has a setting to adjust buffers A Word About Link Latency Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x

Slide 35

Slide 35 text

Host System Kibana Search Node (4) Logstash Logstash Kafka Messaging Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (16) Logstash Kafka Messaging Queue

Slide 36

Slide 36 text

Host System Kibana Search Node (4) Logstash Logstash Kafka Messaging Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (16) Logstash Kafka Messaging Queue

Slide 37

Slide 37 text

• Each host runs logstash indexers that consume from kafka • Each host runs 3 elasticsearch client nodes • Logstash writes round-robin to nodes via local haproxy Logstash Indexing Nodes

Slide 38

Slide 38 text

Throwing Away Money .gif

Slide 39

Slide 39 text

• At 30 days, 170,000 logs/s is 450 billion logs on disk • Logs are between 673 and 2600 bytes each • Not all logs are created equal • Some can be dropped early E_TOO_MUCH_OPEX Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x

Slide 40

Slide 40 text

• There are lots of logs ingested that are never inspected closely • We need them for trending and health checking • We need a consistent portion of the logs Consistent Hashing Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x

Slide 41

Slide 41 text

• Add a UUID to incoming requests at the load balancer • Pass the UUIDs along with the request through the stack • Hash the UUID with murmur3 • code => "require 'murmurhash3'; event['request_id_sampling'] = (MurmurHash3::V32.str_hash(event['[request_id]']).to_f/4294967294)” Consistent Hashing Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x

Slide 42

Slide 42 text

haproxy application database reporting xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800 xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800 xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800 xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800

Slide 43

Slide 43 text

haproxy application database reporting xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800 xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800 xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800 xnutbbq -> .990101 prcbrpa -> .712654 azqlmik -> .921102 lzcdgxh -> .229192 hzjdvjf -> .321409 ezzqrte -> .187037 fhubsxu -> .590189 abkzrme -> .481930 xdheuem -> .003192 jjndtgu -> .113800

Slide 44

Slide 44 text

haproxy application database reporting xnutbbq -> .990101 azqlmik -> .921102 xnutbbq -> .990101 azqlmik -> .921102 xnutbbq -> .990101 azqlmik -> .921102 xnutbbq -> .990101 azqlmik -> .921102

Slide 45

Slide 45 text

haproxy application database reporting xnutbbq -> .990101 azqlmik -> .921102 xnutbbq -> .990101 azqlmik -> .921102 xnutbbq -> .990101 azqlmik -> .921102 xnutbbq -> .990101 azqlmik -> .921102

Slide 46

Slide 46 text

• Setup a small cluster that drops 99% of all log data • Replicate Kibana dashboards between clusters Simple and Cheap Disaster Recovery Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x

Slide 47

Slide 47 text

Host System Kibana Search Node (4) Logstash Logstash Kafka Messaging Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (16) Logstash Kafka Messaging Queue

Slide 48

Slide 48 text

Host System Kibana Search Node (4) Logstash Logstash Kafka Messaging Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk03a) Processor Nodes (16) Logstash Kafka Messaging Queue Logstash Master Nodes (1) Indexing Nodes (5) Storage Nodes (6) Disaster Recovery Cluster (elk04)

Slide 49

Slide 49 text

Other Problems

Slide 50

Slide 50 text

• Most afternoons, ingestion performance dropped • No apparent reason despite instrumentation and investigation • CPU time, disk I/O, network interfaces all normal Lag Du Jour

Slide 51

Slide 51 text

• refresh_interval controls how often lucene merges happen • rebalance_threshold controls how “balanced” the cluster is • bulk queue depth applies back pressure to logstash when elasticsearch is busy Lag Du Jour

Slide 52

Slide 52 text

node1 node2 node3 node4 node5

Slide 53

Slide 53 text

node1 node2 node3 node4 node5

Slide 54

Slide 54 text

node1 node2 node3 node4 node5

Slide 55

Slide 55 text

• Classic hotspot problem where one node underperforms • Fix by increasing refresh_interval • Fix by enforcing total_shards_per_node Lag Du Jour

Slide 56

Slide 56 text

This is a sample image

Slide 57

Slide 57 text

This is a sample image

Slide 58

Slide 58 text

• Leading wildcard searches trigger a full table scan • This means reading all fields of all documents • With hundreds of billions of logs in indexes, this can take… time • indices.query.query_string.allowLeadingWildcard : false Leading Wildcard Searches

Slide 59

Slide 59 text

This is a sample image

Slide 60

Slide 60 text

This is a sample image

Slide 61

Slide 61 text

{ } Could you change datacenters again?

Slide 62

Slide 62 text

Host System Kibana Search Node (5) Logstash Logstash Kafka Messaging Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05a) Processor Nodes (21) Logstash Kafka Messaging Queue Logstash Master Nodes (1) Indexing Nodes (5) Storage Nodes (6) Disaster Recovery Cluster (elk06)

Slide 63

Slide 63 text

• Running in a new datacenter, on slightly newer hardware • Apache Kafka 0.10.1 • Elasticsearch, Logstash, and Kibana up to date* Elasticsearch 5.5.0

Slide 64

Slide 64 text

• When we started, a full cluster restart took ~6 hours • Now, a full restart of three clusters takes less than 45 minutes • More reliable, faster and more resilient to failure • > 380,000 logs/s during stall recovery Where We Are Now

Slide 65

Slide 65 text

Host System Kibana Search Node (5) Logstash Logstash Kafka Messaging Queue Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05c) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05b) Master Nodes (3) Indexing Nodes (30) Storage Nodes (40) Production Cluster (elk05a) Processor Nodes (21) Logstash Kafka Messaging Queue Logstash Master Nodes (1) Indexing Nodes (5) Storage Nodes (6) Disaster Recovery Cluster (elk06)

Slide 66

Slide 66 text

• Unique count of items in a set • Elasticsearch uses HyperLogLog • 5.3.x has a (since fixed) bug where search result buckets are allocated before circuit breakers are evaluated Testing Is Important Cardinality Aggregations and You

Slide 67

Slide 67 text

• Elasticsearch 6.x is going to require a lot of testing • Support > 400,000 logs/s in 2018 • Explore containerized deployment • Reduce use of custom tooling, move to _rollover and curator Where We Go Next

Slide 68

Slide 68 text

More Questions? Visit us at the AMA

Slide 69

Slide 69 text

www.elastic.c o

Slide 70

Slide 70 text

Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/ Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. Please attribute Elastic with a link to elastic.co