Monitoring Elasticsearch for Fun, Profit and Not Getting Woken Up at 3am

Slide 1

Slide 1 text

Beyond Traffic Lights lessons learned monitoring Elasticsearch for fun, profit and not getting woken up at 03:00

Slide 2

Slide 2 text

n commandments for cluster monitoring ● Monitor everything. ● Only alert on metrics the customer (dev team/users/et cetera) cares about. ● Metrics should be defined for the cluster as a whole, not just individual nodes. ● Automate recovery wherever possible.

Slide 3

Slide 3 text

Monitor ALL the things ● You never have enough data. ● The stats API is your friend. ● Store everything.

Slide 4

Slide 4 text

Cluster-wide metrics ● Necessary for e.g., splitbrain checks ● Not well-suited for host-based monitoring systems

Slide 5

Slide 5 text

Detecting splitbrains nagios check_splitbrain check_topology check_topology NRPE NRPE es1 es2 [ … ]

Slide 6

Slide 6 text

Shockingly, this type of check doesn't work too well without concurrency topologies := make([]string, nNodes) masters := make(map[string]bool) c := make(chan string, nNodes) for _, node := range nodes { go getTopology(node, c) } for i, _ := range nodes { topologies[i] = <-c } masterList := make([]string, 0) for _, topology := range topologies { topologyMaster := getMaster(topology, nodes) if _, ok := masters[topologyMaster]; !ok { masterList = append(masterList, topologyMaster) } masters[topologyMaster] = true }

Slide 7

Slide 7 text

On not getting woken up at 3am ● Loss of redundancy is not (always) a failure condition. – So you don't need to alert whenever you lose a node. – Automating alert response can reduce operational workload and human error. Splitbrain alert? - log into every node in the cluster - set cluster.blocks.read_only = true - send out a notification - sysadmin rectifies the partition - minimal data loss

Slide 8

Slide 8 text

Benchmarks and performance tuning ● Uncontrolled tests are worse than useless. ● So act like data scientists. – actual production queries – one change at a time – every benchmark should be reproducible Sometimes graphs aren't enough.

Slide 9

Slide 9 text

Benchmarks and performance tuning └─> cat *_facet_results_times | numbers 1q: 239.000000 3q: 537.500000 99p: 962.500000 mean: 405.655410 median: 335.000000 std: 208.026379 var: 43274.974297 Sometimes you need the actual numbers.

Slide 10

Slide 10 text

Fin. Queries? Sharif Olorin [email protected]