alert on metrics the customer (dev team/users/et cetera) cares about. • Metrics should be defined for the cluster as a whole, not just individual nodes. • Automate recovery wherever possible.
redundancy is not (always) a failure condition. – So you don't need to alert whenever you lose a node. – Automating alert response can reduce operational workload and human error. Splitbrain alert? - log into every node in the cluster - set cluster.blocks.read_only = true - send out a notification - sysadmin rectifies the partition - minimal data loss
useless. • So act like data scientists. – actual production queries – one change at a time – every benchmark should be reproducible Sometimes graphs aren't enough.