Slide 1

Slide 1 text

Incident Debugging

Slide 2

Slide 2 text

Workflow Instrument service and infrastructure layer Set up SLIs that reflects the overall health of the application Set up SLOs that reflects the amount of downtime that you aim to support Respond to alerts when they are triggered After recovery, assess how effective instrumentation was, make plans to refine it if necessary

Slide 3

Slide 3 text

Debugging latency & 5xx errors SloLatencyTooHigh alert goes off, first steps: Get the 99th, 95th response time broken down by service , http method and status_code Check upstream service dependencies latency/resource utilization Check outbound TCP connections request size, rate Check the node resource usage the service runs at If node is saturated, DNS queries will be either slow or fail

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

DNS issues in k8s lookup("server", "A") // all lookups bellow are triggered by a single lookup lookup("server", "AAAA") lookup("server.platform.svc.cluster.local", "A") lookup("server.platform.svc.cluster.local", "AAAA") lookup("server.svc.cluster.local", "A") lookup("server.svc.cluster.local", "AAAA") lookup("server.cluster.local", "A") lookup("server.cluster.local", "AAAA") # cat /etc/resolv.conf nameserver 169.254.20.10 search platform.svc.cluster.local svc.cluster.local cluster.local options ndots: 5 timeout:1 attempts:5

Slide 6

Slide 6 text

Conntrack Kernel module that tracks in/out UDP connections Table format: src + dst IP, src + dst port and connection state conntrack -L tcp ESTABLISHED src=172.24.110.67 dst=172.24.124.188 sport=29290 dport=31520 src=172.24.44.248 dst=172.24.124.188 sport=80 dport=29290

Slide 7

Slide 7 text

The Bug There is a race condition in conntrack When two packets are sent via the same socket at the same time Packets get dropped DNS lookup remains in waiting state until it times out Only happens for UDP

Slide 8

Slide 8 text

Watch out AWS VPC Limits Max of 1024 packets per sec and interface for aws dns server Remember there were 8 lookups out of one DNS resolution

Slide 9

Slide 9 text

Solution

Slide 10

Slide 10 text

CDN Cache Miss Latency can get significantly increased by missing a cache rule in CDN

Slide 11

Slide 11 text

Analyzing Prometheus Cardinality https://www.robustperception.io/using-tsdb-analyze-to-investigate- churn-and-cardinality

Slide 12

Slide 12 text

Thank you