Save 37% off PRO during our Black Friday Sale! »

Scaling Log Aggregation At Fitbit with Elasticsearch

Dd9d954997353b37b4c2684f478192d3?s=47 Elastic Co
July 14, 2017
430

Scaling Log Aggregation At Fitbit with Elasticsearch

Scaling Log Aggregation At Fitbit with Elasticsearch by Breandan Dezendorf

A walkthrough of scaling a Elasticsearch based log aggregation pipeline from 30,000 logs per second to over 225,000 logs per second in a demanding multi-user environment. This process involved upgrades to every part of the pipeline and changing out major architectural features along the way. Also discussed will be some of the design considerations and challenges for disaster recovery, long term archiving and practical limitations of running very large cost effective Elasticsearch clusters.

Breandan has been working in UNIX and Linux operations for over 15 years. His specialties include monitoring, alerting, trending, and log aggregation at scale. Recently he has been focused on scaling log aggregation for Fitbit, Inc to over 225,000 logs per second.

https://www.meetup.com/Elastic-Triangle-User-Group/events/240095445/

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

July 14, 2017
Tweet

Transcript

  1. SCALABLE LOG AGGREGATION AT FITBIT BREANDAN DEZENDORF BREANDAN@42LINES.NET @BWDEZEND

  2. CLASSIC ELK STACK OVERVIEW

  3. THE ROAD SO FAR OVERVIEW • Started at 35,000 logs/s,

    6 days of retention (15-18 billion logs before replication) • Currently averaging 202,000 log/s, between 7 and 90 days of retention (160 billion logs before replication), with sustained peak of 240,000 logs/s • Forecasting suggests we need to sustain > 300,000 logs/s by Q2 of 2018 with burst capacity at 400,000 logs/s.
  4. PRIVACY AND DATA SOURCES OVERVIEW • Fitbit never puts tracker

    data in ELK • No step counts, no heartbeats, no elevation changes • No payments information • Only application/firewall/database logs
  5. AT LEAST, FOR US THREE KINDS OF SEARCH • Release

    Monitoring / Incident Management • Historical Surveys • Historical Incidents
  6. (ELASTICSEARCH 1.5) IN THE BEGINNING • 35,000 logs/s average, with

    peaks of 45,000 during the day's high- water mark • 55,000 logs/s during indexing stall recovery
  7. (ELASTICSEARCH 1.5) IN THE BEGINNING • Elasticsearch 1.5.x, Kibana 3.x

    • 3 masters (16GB RAM, spinning disks) • 30 data nodes (64GB RAM, SSDs) • 4 api nodes (64GB RAM, SSDs) • 4 Redis hosts (16 GB RAM, spinning disks)
  8. (ELASTICSEARCH 1.5) SCALING AND PERFORMANCE • Issues with accuracy of

    indices.breaker.fielddata.limit • ZenDisco has a fixed 30 second timeout (with two retries) • Indexing fighting with search for resources • facet searches are expensive and consume lots of heap
  9. (ELASTICSEARCH 1.5) SCALING AND PERFORMANCE • haproxy logs were high

    enough volume that logstash/redis couldn’t keep up. UDP to the rescue. • Performance was still bad, so all haproxy logs with 200-series http status codes were dropped • Grok parsing is expensive on logs • mysql slow logs have binary data
  10. –Heraclitus of Ephesus “CHANGE IS THE ONLY CONSTANT.”

  11. NEW DESIGN GOALS • Increase retention from 5 to 30

    days • Double the number of hosts sending logs • Index a larger set of log types • Stop dropping haproxy 200s • Handle two years of growth, including spikes to 2x traffic
  12. PLANNING THE NEW CLUSTER: ELK02 • Replace Redis with Apache

    Kafka • Move to hot/warm architecture • Archive logs to S3 • Analyze kibana query logs
  13. DESIGN THE NEW CLUSTER: ELK02 • Increase retention from 5

    to 30 days • Double the number of hosts sending logs • Index a larger set of log types • Stop dropping haproxy 200s • Handle two years of growth, including spikes to 2x traffic
  14. REDIS VS KAFKA THE NEW CLUSTER: ELK02 • Redis has

    no cluster needs, and is simple to operate/understand • With Logstash + Redis, there is limited backlog queuing (memory based redis queue) • Data can only be consumed once • Standing up more single points of failure (queues) wasn’t going to scale
  15. LET'S NOT DO THIS THIS WOULD BE BAD

  16. HOT/WARM TIERS THE NEW CLUSTER: ELK02 • Separate search from

    indexing • Live data is indexed into hosts with only 48 hours of indexes • More memory, CPU and disk I/O available for indexing incoming data • Older data has different access patterns, and spinning disks are good enough*
  17. DATA MIGRATION THE NEW CLUSTER: ELK02 • Data migration between

    tiers is simple: node.tag_tier: elk02-data-rw node.tag_tier: elk02-data-ro curl -XPUT localhost:9200/logstash-2017-07-12—main/_settings -d '{ "index.routing.allocation.exclude.tag": "elk02-data-rw", "index.routing.allocation.include.tag": "elk02-data-ro" }'
  18. DONEC QUIS NUNC

  19. INGEST ELASTICSEARCH LOGS FOR ANALYSIS OTHER GOOD IDEAS • Ingest

    elasticsearch index and slow query into elasticsearch • Be careful of logging loops! • You can now see the queries, how efficient they are, and what indexes are being accessed • Very useful to understand what the users want to do, and helpful when there are problems
  20. OTHER GOOD IDEAS if [type] == "elasticsearch" { grok {

    match => { "message" => "%{DATESTAMP:timestamp}\]\[% {WORD:level}\s*\]\[%{NOTSPACE:module}\s*\]( \[% {NOTSPACE:node}\s*\])? %{GREEDYDATA:es_message}" } } }
  21. if [type] == "es_slow_query" { grok { match => {

    "message" => "%{DATESTAMP:timestamp}\]\[%{WORD:level}\s*\]\[%{NOTSPACE:module}\s*\]( \[%{NOTSPACE:node}\s*\])? ( \[%{NOTSPACE:index}\]\[%{NUMBER:index_shard}\])? took\[%{NOTSPACE:took}\], took_millis\[%{NUMBER:took_millis}\], types\[(% {NOTSPACE:types})?\], stats\[(%{NOTSPACE:stats})?\], search_type\[(%{NOTSPACE:search_type})?\], total_shards\[% {NUMBER:total_shards}\], source\[%{GREEDYDATA:source_query}\], extra_source\[(%{GREEDYDATA:extra_source})?\]," } } grok { match => { "source_query" => "{\"range\":{\"@timestamp\":{\"gte\":%{NUMBER:query_time_gte},\"lte\":% {NUMBER:query_time_lte},\"format\":\"epoch_millis\"}}}" } match => { "source_query" => "{\"range\":{\"@timestamp\":{\"to\":\"now-%{DATA:query_time_to}\",\"from\":\"now-% {DATA:query_time_from}\"}}}" } match => { "source_query" => "{\"range\":{\"@timestamp\":{\"to\":\"%{DATA:query_time_date_to}\",\"from\":\"% {DATA:query_time_date_from}\"}}}" } } if [query_time_date_to] { ruby { code => "event['query_time_range_minutes'] = (( DateTime.parse(event['query_time_date_to']) - DateTime.parse(event['query_time_date_from']) ) * 24 * 60 ).to_i" } } if [query_time_gte] { ruby { code => "event['query_time_range_minutes'] = (event['query_time_lte'].to_i - event['query_time_gte'].to_i)/1000/60" } ruby { code => "event['query_time_date_to'] = DateTime.strptime(event['query_time_lte'],'%Q').to_s" } ruby { code => "event['query_time_date_from'] = DateTime.strptime(event['query_time_gte'],'%Q').to_s" } } mutate { add_field => { "statsd_timer_name" => "es.slow_query_duration" } add_field => { "statsd_timer_value" => "%{took_millis}" } add_tag => [ "metric_timer" ] } }
  22. LONG TERM ARCHIVING OTHER GOOD IDEAS • Logstash reads from

    kafka and uses the file output plugin to store logs on disk • Use pigz -9 to compress the files • AES encrypt and sync to S3 • Set a lifecycle policy to roll logs into STANDARD_IA after 90 days
  23. LONG TERM ARCHIVING OTHER GOOD IDEAS • Be careful to

    watch TCP window sizes • Bandwidth delay products can be nasty here • Logstash's default is a fixed window of 64KB, which on a high latency link will painfully limit replication speeds
  24. –Heraclitus of Ephesus “CHANGE IS THE ONLY CONSTANT.”

  25. TYPICAL 5 DAY INDEXING

  26. WHAT'S THIS?

  27. THIS ISN'T A SPIKE

  28. THIS IS THE NEW NORMAL

  29. DESIGN GOALS THE NEW NEW CLUSTER: ELK03 • Support arbitrary

    doublings of log traffic with little notice (3-6 weeks) • Limit cluster node counts • Limit the impact of failure • Elasticsearch 2.x and Kibana4.x • Kafka lets us run multiple clusters in parallel
  30. None
  31. HEY, CAN YOU SWITCH DATACENTERS PLEASE?

  32. TRIBE NODES THE NEW NEW CLUSTER: ELK03 • Tribe nodes

    act as a client node that talks to more than one cluster at the same time • They are full members of each cluster, for all the good and bad that entails • Merged search results are good • Tribe has issues that prevent upgrades, etc
  33. LIMIT CLUSTER NODE COUNT THE NEW NEW CLUSTER: ELK03 •

    Every node in an elasticsearch cluster makes (and holds open) 13 TCP connections to every other node • Every node must ack changes to the cluster • ZenDisco has timeouts that are set for a reason • Using tribe nodes, we now scale ELK tested units rather than growing tiers
  34. EACH SUBCLUSTER HAS: THE NEW NEW CLUSTER: ELK03 • 3

    master hosts • 30 data-rw hosts • 40 data-ro hosts • a pool of 25 shared indexer hosts
  35. API NODE CONFIGURATION THE NEW NEW CLUSTER: ELK03 • Runs

    logstash processes feeding from kafka • Outputs to local haproxy instance • haproxy load balances across three elasticsearch client nodes • Allows us to trivially remove a cluster from ingestion for maintenance
  36. ES 2.X BENEFITS THE NEW NEW CLUSTER: ELK03 • Performance

    enhancements and bug fixes • Kibana 4 releases are tied to Elasticsearch 2.x releases • Improvements for doc_values and other field related settings
  37. ES 2.X... OPPORTUNITIES THE NEW NEW CLUSTER: ELK03 • Fieldnames

    may no longer contain dots • No Kibana 3 support. Facets were deprecated in Elasticsearch 1.x, and in 2.x they were removed • Lots of testing needed to validate the impact of other changes
  38. A TRILLION LOGS COSTS MONEY CAN WE IMPROVE?

  39. CONSISTENT HASHING THE NEW NEW CLUSTER: ELK03 • 30 days

    of all data is 855 billion documents in the cluster before replication • This costs disk and heap, which costs money • How do you make sure to keep all logs from a single request, and consistently drop the rest?
  40. CONSISTENT HASHING THE NEW NEW CLUSTER: ELK03 • Tag logs

    at the load balancer with a request_id • Add to subsequent logs throughout the stack if [request_id] { ruby { code => "require 'murmurhash3'; event['request_id_sampling'] = (MurmurHash3::V32.str_hash(event['[request_id]']).to_f/4294967294)" } }
  41. CONSISTENT HASHING THE NEW NEW CLUSTER: ELK03 • Also set

    a flag for "never drop my logs" if [request_id_sampling] { # Drop a percentage of request_id_sampling #(Min: 0.0, Max: 0.999999999) if (( [request_id_sampling] < <%= @sampling_rate %> ) and ( [sample_ok] != "no" )) { drop {} } }
  42. CONSISTENT HASHING THE NEW NEW CLUSTER: ELK03 • Apply sparingly!

    • Be clear about what services and logging paths will be sampled • Ideally, store a field in the record with the sampling rate so a consumer of the logs knows what % was being dropped • We drop approximately 1/3 of all log volume • Saves considerably on resource needs and therefore cost
  43. DISASTER RECOVERY THE NEW NEW CLUSTER: ELK03 • The disaster

    recovery logging cluster drops 99% of log data • Many fewer hosts (1 master, 6 data, 7 api) • Otherwise identical to production • If there's an disaster event, scale up and change sampling rate
  44. NO DOTS FOR YOU THE NEW NEW CLUSTER: ELK03 "foo":

    "val1" "foo": { "bar": "val2" } • foo is a field and a top level object • The logstash de_dot plugin is not efficient (minimum of 2x CPU use to have it enabled) • Elasticsearch > 2.4.0 has a setting to enable dots again
  45. INTERMISSION

  46. MAPPING RACE OPPORTUNITY #1 • Master nodes have a single

    threaded task queue • All mapping update events go into this queue • So do cluster state changes
  47. BROKEN SHARDS AND TRANSLOGS OPPORTUNITY #1 • Two shards on

    different nodes can try to set the field mapping differently • There is an race condition that allows both to locally set the state, and then try to replay the changes to other shards • Once this happens, the shard to lose the race is considered corrupted, and keeps trying to replicate itself out to other nodes
  48. CLUSTER STATE OPPORTUNITY #1 • Eventually the cluster state mapping

    gets corrupted • Different nodes see different things • The only way out is restarting all the nodes • Simply growing the static mapping means more cluster state to carry (as each index gets it's own copy of the static mapping, used fields or not)
  49. SHARD BALANCE OPPORTUNITY #2 • Elasticsearch tries by default to

    balance the shard count per node • Not by size of shard or document count • concurrent_rebalance controls how many shards can move at once • rebalance_threshold controls how "balanced" elasticsearch tries to get the nodes
  50. REFRESH INTERVAL AND YOU OPPORTUNITY #3 • The index refresh_interval

    controls how often lucene segments are merged • Data isn't visible in search until segment merge • Merging takes priority over indexing
  51. PROBLEMS AND SOLUTIONS • Software releases change the ratio of

    tomcat:app:haproxy logs • Not having perfect balance means some nodes get more data of a given type • refresh_interval of 5s means that we are merging segments often, causing memory pressure and stopping indexing
  52. PROBLEMS AND SOLUTIONS • mdc.kafka, mdc.db and mdc.slc get bad

    mappings at 00:00 UTC • QA indexes start to fail shards • Cluster states get corrupted • Partial fixes are done to stabilize things (turn off QA indexing, delete failed indexes) • Restart elk03c cluster (full restart) • Set refresh_interval on all indexes
  53. MINOR THINGS OTHER CHANGES • Shard relocation in Elasticsearch 2.x

    is an emergency task • Moving 1,000 shards from data-rw to data-ro can block other cluster tasks, including mapping updates for new fields and index creation • Only move one index at a time (per cluster)
  54. OPPORTUNITY #4 • Elasticsearch 2.x assumes SSDs • Queries with

    leading wildcards cause Elasticsearch to do a full table scan on the index. • Yes. Read every field. Of every possible match • With SSDs and a small dataset, this causes a >10x performance penalty for a search
  55. *STORMY • With spinning disks and hundreds of billions of

    documents in the cluster it takes hours to read every document • indices.query.query_string.allowLeadingWildcard: false • And then restart every node in the cluster
  56. IMPORTANT KNOBS IN 2.X • indices.query.query_string.allowLeadingWildcard: false • cluster.routing.allocation.balance.threshold: 1.1f

    • indices.memory.index_buffer_size: 50% • bootstrap.mlockall: true • cluster.routing.allocation.cluster_concurrent_rebalance • index.refresh_interval: 15s
  57. HEY, CAN YOU SWITCH DATACENTERS PLEASE?

  58. THE NEW NEW NEW CLUSTER • Identical architecture to elk03{a..c}

    with fewer, larger servers • Elasticsearch 5.3.3 and Kibana 5.3.1 • Kafka lets us do this seamlessly
  59. ELK05 DASHBOARDS • Dashboards from Elasticsearch 2.x import directly into

    5.x • Using logstash magic, export .kibana-prod and .kibana-qa from elk03a • Publish to Kafka • Subscribe into elk05
  60. ELK05 CARDINALITY ISSUES • Cardinality aggregation ("Unique Count") can be

    expensive in 5.2.x -> 5.4.2 • Buckets pre-allocated before circuit breaker • OOME when user selects Unique Count on userID. (Fitbit has a few unique userIDs) • https://github.com/elastic/elasticsearch/issues/24359
  61. ELK05 CARDINALITY ISSUES • Testing new versions of Elasticsearch takes

    time to validate in a cluster this size • 5.3.3 is considered unsupported at this point, so there is no planned fix • Edit src/ui/public/agg_types/index.js and remove AggTypesMetricsCardinalityProvider for now • Begin evaluation of 5.4.2 and/or 5.5.0
  62. QUESTIONS? • @bwdezend • breandan@42lines.net • http://operations.fm