Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring Elasticsearch for Fun, Profit and Not Getting Woken Up at 3am

Monitoring Elasticsearch for Fun, Profit and Not Getting Woken Up at 3am

Given at the first Elasticsearch Sydney meetup in November 2013.

Sharif Olorin

November 21, 2013
Tweet

Other Decks in Programming

Transcript

  1. n commandments for cluster monitoring • Monitor everything. • Only

    alert on metrics the customer (dev team/users/et cetera) cares about. • Metrics should be defined for the cluster as a whole, not just individual nodes. • Automate recovery wherever possible.
  2. Monitor ALL the things • You never have enough data.

    • The stats API is your friend. • Store everything.
  3. Cluster-wide metrics • Necessary for e.g., splitbrain checks • Not

    well-suited for host-based monitoring systems
  4. Shockingly, this type of check doesn't work too well without

    concurrency topologies := make([]string, nNodes) masters := make(map[string]bool) c := make(chan string, nNodes) for _, node := range nodes { go getTopology(node, c) } for i, _ := range nodes { topologies[i] = <-c } masterList := make([]string, 0) for _, topology := range topologies { topologyMaster := getMaster(topology, nodes) if _, ok := masters[topologyMaster]; !ok { masterList = append(masterList, topologyMaster) } masters[topologyMaster] = true }
  5. On not getting woken up at 3am • Loss of

    redundancy is not (always) a failure condition. – So you don't need to alert whenever you lose a node. – Automating alert response can reduce operational workload and human error. Splitbrain alert? - log into every node in the cluster - set cluster.blocks.read_only = true - send out a notification - sysadmin rectifies the partition - minimal data loss
  6. Benchmarks and performance tuning • Uncontrolled tests are worse than

    useless. • So act like data scientists. – actual production queries – one change at a time – every benchmark should be reproducible Sometimes graphs aren't enough.
  7. Benchmarks and performance tuning └─> cat *_facet_results_times | numbers 1q:

    239.000000 3q: 537.500000 99p: 962.500000 mean: 405.655410 median: 335.000000 std: 208.026379 var: 43274.974297 Sometimes you need the actual numbers.