incident-debugging.pdf

Incident Debugging

Workflow Instrument service and infrastructure layer Set up SLIs that
reflects the overall health of the application Set up SLOs that reflects the amount of downtime that you aim to support Respond to alerts when they are triggered After recovery, assess how effective instrumentation was, make plans to refine it if necessary

Debugging latency & 5xx errors SloLatencyTooHigh alert goes off, first
steps: Get the 99th, 95th response time broken down by service , http method and status_code Check upstream service dependencies latency/resource utilization Check outbound TCP connections request size, rate Check the node resource usage the service runs at If node is saturated, DNS queries will be either slow or fail

DNS issues in k8s lookup("server", "A") // all lookups bellow
are triggered by a single lookup lookup("server", "AAAA") lookup("server.platform.svc.cluster.local", "A") lookup("server.platform.svc.cluster.local", "AAAA") lookup("server.svc.cluster.local", "A") lookup("server.svc.cluster.local", "AAAA") lookup("server.cluster.local", "A") lookup("server.cluster.local", "AAAA") # cat /etc/resolv.conf nameserver 169.254.20.10 search platform.svc.cluster.local svc.cluster.local cluster.local options ndots: 5 timeout:1 attempts:5

Conntrack Kernel module that tracks in/out UDP connections Table format:
src + dst IP, src + dst port and connection state conntrack -L tcp ESTABLISHED src=172.24.110.67 dst=172.24.124.188 sport=29290 dport=31520 src=172.24.44.248 dst=172.24.124.188 sport=80 dport=29290

The Bug There is a race condition in conntrack When
two packets are sent via the same socket at the same time Packets get dropped DNS lookup remains in waiting state until it times out Only happens for UDP

Watch out AWS VPC Limits Max of 1024 packets per
sec and interface for aws dns server Remember there were 8 lookups out of one DNS resolution

Solution

CDN Cache Miss Latency can get significantly increased by missing
a cache rule in CDN

Analyzing Prometheus Cardinality https://www.robustperception.io/using-tsdb-analyze-to-investigate- churn-and-cardinality

Thank you

incident-debugging.pdf

incident-debugging.pdf

Rafael Jesus

More Decks by Rafael Jesus

Featured

Transcript