reflects the overall health of the application Set up SLOs that reflects the amount of downtime that you aim to support Respond to alerts when they are triggered After recovery, assess how effective instrumentation was, make plans to refine it if necessary
steps: Get the 99th, 95th response time broken down by service , http method and status_code Check upstream service dependencies latency/resource utilization Check outbound TCP connections request size, rate Check the node resource usage the service runs at If node is saturated, DNS queries will be either slow or fail
two packets are sent via the same socket at the same time Packets get dropped DNS lookup remains in waiting state until it times out Only happens for UDP