Monitoring Elasticsearch for Fun, Profit and Not Getting Woken Up at 3am

Beyond Traffic Lights lessons learned monitoring Elasticsearch for fun, profit
and not getting woken up at 03:00

n commandments for cluster monitoring • Monitor everything. • Only
alert on metrics the customer (dev team/users/et cetera) cares about. • Metrics should be defined for the cluster as a whole, not just individual nodes. • Automate recovery wherever possible.

Monitor ALL the things • You never have enough data.
• The stats API is your friend. • Store everything.

Cluster-wide metrics • Necessary for e.g., splitbrain checks • Not
well-suited for host-based monitoring systems

Detecting splitbrains nagios check_splitbrain check_topology check_topology NRPE NRPE es1 es2
[ … ]

Shockingly, this type of check doesn't work too well without
concurrency topologies := make([]string, nNodes) masters := make(map[string]bool) c := make(chan string, nNodes) for _, node := range nodes { go getTopology(node, c) } for i, _ := range nodes { topologies[i] = <-c } masterList := make([]string, 0) for _, topology := range topologies { topologyMaster := getMaster(topology, nodes) if _, ok := masters[topologyMaster]; !ok { masterList = append(masterList, topologyMaster) } masters[topologyMaster] = true }

On not getting woken up at 3am • Loss of
redundancy is not (always) a failure condition. – So you don't need to alert whenever you lose a node. – Automating alert response can reduce operational workload and human error. Splitbrain alert? - log into every node in the cluster - set cluster.blocks.read_only = true - send out a notification - sysadmin rectifies the partition - minimal data loss

Benchmarks and performance tuning • Uncontrolled tests are worse than
useless. • So act like data scientists. – actual production queries – one change at a time – every benchmark should be reproducible Sometimes graphs aren't enough.

Benchmarks and performance tuning └─> cat *_facet_results_times | numbers 1q:
239.000000 3q: 537.500000 99p: 962.500000 mean: 405.655410 median: 335.000000 std: 208.026379 var: 43274.974297 Sometimes you need the actual numbers.

Fin. Queries? Sharif Olorin [email protected]

Monitoring Elasticsearch for Fun, Profit and No...

Monitoring Elasticsearch for Fun, Profit and Not Getting Woken Up at 3am

Sharif Olorin

Other Decks in Programming

Featured

Transcript

Beyond Traffic Lights lessons learned monitoring Elasticsearch for fun, profit

n commandments for cluster monitoring • Monitor everything. • Only

Monitor ALL the things • You never have enough data.

Cluster-wide metrics • Necessary for e.g., splitbrain checks • Not

Detecting splitbrains nagios check_splitbrain check_topology check_topology NRPE NRPE es1 es2

Shockingly, this type of check doesn't work too well without

On not getting woken up at 3am • Loss of

Benchmarks and performance tuning • Uncontrolled tests are worse than

Benchmarks and performance tuning └─> cat *_facet_results_times | numbers 1q:

Fin. Queries? Sharif Olorin [email protected]