Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tales from the Ops Side - My Favourite Errors

Tales from the Ops Side - My Favourite Errors

Errors can be a source of stress in our professional lives. More often than not its discovery can cause us to drop what we're doing to investigate. As an Ops person, I've come to accept errors as fact of life, and even come to appreciate them. Knowing your errors can help you pinpoint the source of the problem quickly. Over the years I've compiled a large list of helpful (and not helpful) error messages. This session goes through my favourite ones, along with interesting tales of troubleshooting and resolutions.

Hany Fahim

May 29, 2019
Tweet

More Decks by Hany Fahim

Other Decks in Technology

Transcript

  1. Errors are a part of the job • Must embrace

    the fact that things will break. • Come to appreciate various errors and what they mean. • When the pager goes off, first question is “what is the error?” • Knowing what error messages mean can greatly help speed up an investigation. • It hints at where to look first to save valuable time.
  2. ? First action:
 Does it ping? $ ping mysite.org I

    use the fully qualified
 domain name (FQDN) Results may reveal a
 problem high up in the stack
  3. Why I like this error • Reveals the problem right

    away. • May need renewal, or DNS may be down. • Can trick monitoring systems. • Follow-up: Add domain expiry monitoring.
  4. Why I like this error • Tells you there’s a

    routing problem somewhere between you and the site. • Use traceroute to pin-point the problem. • Might be my local ISP, my cloud provider, or somewhere in between. • Time to call upstream.
  5. Why I like this error • Points me to the

    network provider as the likely culprit. • Can be caused by misconfiguration, failed hardware, or overloads. • Problem may lay anywhere between your local ISP, cloud provider, or somewhere in between. • traceroute should reveal the problem routes.
  6. $ traceroute -n mysite.org 1 10.0.1.1 1.682 ms 1.665 ms

    1.124 ms 2 * * * 3 * * * 4 104.195.128.21 16.295 ms 16.854 ms 12.788 ms 5 206.248.155.94 15.829 ms 206.248.155.17 14.371 ms 22.065 ms 6 198.32.181.56 18.305 ms 13.123 ms 19.671 ms 7 198.32.181.51 19.665 ms 12.031 ms * 8 198.32.181.56 18.631 ms 19.121 ms * 9 198.32.181.51 12.920 ms 14.301 ms * … 64 198.32.181.51 16.259 ms 16.343 ms * Most systems stop at 64 hops $ sysctl net.inet.ip.ttl net.inet.ip.ttl: 64 Notice these hops
 are bouncing back
 and forth to each
 other.
  7. Why I like this error • Points me to the

    proxy as the likely culprit. • Issue may be transient (like a process restarting) • Persistent errors may indicate an overload. • Check the kernel logs on the proxy (if I can).
  8. Why I like this error • Points to the proxy

    as the culprit. • Might be due to a downed or crashed process (out-of- memory). • Check the proxy!
  9. Why I hate this error • Problem can be anywhere.

    • Can indicate an overload, misconfiguration (firewall, proxy, app, DB), rate limiting, network problem, local system problem, etc… • Hope that a better error message is sitting somewhere.
  10. However most errors I’ve seen are HTTP errors. HTTP errors

    are higher up the stack and are more complex to diagnose and resolve.
  11. 4xx

  12. Why I like this error • Indicates the client issued

    an invalid request (most likely no request at all). • Could indicate an interrupted connection (connection reset by peer). • Could also be caused by too large a request (8kb limit on HAProxy) • Check to see if there are client-side errors to accompany the HTTP 400.
  13. Why I like this error • Pay your bills. •

    Some providers (like Google Developers API) use 402’s to notify you of request limits. • Shopify uses this to notify of suspended accounts due to non-payment.
  14. Why I like this error • If unexpected, may be

    due to file permissions. • If requests are all from bots (and are persistent), could be due to broken links. • If logs are flooded, could be a brute force attack. • Sites have gone down due to 404s.
  15. Why I like this error • Tells you the client

    is trying to reach the wrong endpoint. • Seen often on strict API endpoints. • If this is unexpected, it could indicate a misconfiguration between the application and web server.
  16. Why I like this error • Unlike other types of

    timeouts, this almost always indicates a client-side issue. • Recent years, this has cropped up more and more due to “TCP pre-connect” optimizations in browsers. • To avoid client’s seeing this error, tune the proxy to send empty responses.
  17. Why I like this error • I don’t really. •

    It’s an nginx-specific error, but likely is coupled with a different error elsewhere. • Could indicate the end-client giving up on the request (user clicked “Stop” in their browser). • Could indicate a timeout at the proxy. • Keep looking…
  18. 5xx

  19. HTTP 502 LOG upstream prematurely closed connection while reading response

    header from upstream kevent() reported that connect() failed (61: Connection refused) while connecting to upstream
  20. Why I like this error • Could indicate an issue

    with nginx (unlikely) or the application (likely). • Often seen when applications hard-crash (out-of- memory, unexpected error). • Also seen when application servers timeout (i.e.: gunicorn timeouts). • Check application and kernel logs.
  21. Why I like this error • Usually manually triggered. •

    Could also be due to overload (no more slots to serve traffic). • Most likely transient.
  22. Why I hate this error • Can be very confusing,

    especially when app logs are clean. • Can be due to mis-matched timeouts along the stack. • Like most other “timeouts”, true issue may be hidden.
  23. Why I LOVE this error • My all-time favourite error.

    • Always comes from the application. • Because it’s generated from the application, errors are usually more helpful. • Even generic errors are helpful since you know it’s coming from elsewhere in the stack. May occur higher-up in the stack with custom modules
 (i.e.: Lua processing).
  24. Honourable Mentions • couldn't obtain random bytes • operation not

    permitted • socket timeout after 10 seconds • segmentation fault
  25. Questions… • Why was this working for some clients and

    not others? • Called other team members, it was working for them too. • Taking a look at the graphs…
  26. More questions… • Works for some, but not others? •

    Team member noted site wasn’t rendering on their phone. Confirmed!
  27. More questions… • Same network, different clients, different results? •

    What was the difference? • Remembered laptop auto-connects to our VPN.
  28. The fact that it was working rules out a routing

    problem. Where is the error? Why was it hanging?
  29. Try telnet $ telnet brokensite.net 80 Trying x.x.x.x... Connected to

    brokensite.net. Escape character is '^]'. GET / HTTP/1.0 Something is responding! No response.
 Not even a timeout. The IP is correct.
  30. 14:02:02.779102 IP y.y.y.y.47553 > x.x.x.x.80: Flags [P.], seq 3426796935:3426797367, ack

    834604261, win 4117, options [nop,nop,TS val 230425600 ecr 4106181590], length 432 [email protected][email protected].............. .....[.GET / HTTP/1.1 Host: brokensite.net Connection: keep-alive Upgrade-Insecure-Requests: 1 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/ *;q=0.8,application/signed-exchange;v=b3 Accept-Encoding: gzip, deflate Accept-Language: en-US,en;q=0.9,gl;q=0.8 ^C 1 packets captured 3 packets received by filter 0 packets dropped by kernel On the proxy… $ tcpdump -A -s 0 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' -i eth0 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 14:02:02.779102 IP y.y.y.y.47553 > x.x.x.x.80: Flags [P.], seq 3426796935:3426797367, ack 834604261, win 4117, options [nop,nop,TS val 230425600 ecr 4106181590], length 432 [email protected][email protected].............. .....[.GET / HTTP/1.1 Host: brokensite.net Connection: keep-alive Upgrade-Insecure-Requests: 1 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/ *;q=0.8,application/signed-exchange;v=b3 Accept-Encoding: gzip, deflate Accept-Language: en-US,en;q=0.9,gl;q=0.8 ^C 1 packets captured 3 packets received by filter 0 packets dropped by kernel Fancy way to sniff traffic and get full
 request/response cycle. Normal request coming through. Absolutely nothing came through.
  31. Who am I talking to? • Where is my manual

    request? • Not even getting standard TCP packets (SYN, SYN-ACK, ACK) for established connections. • Who am I connecting to?
  32. Phoned upstream • Any reported issues? • Nothing usual. •

    Except for the non-service impacting maintenance.
  33. DDoS Protection • Upstream’s upstream was installing new DDoS mitigation

    hardware. • Should not have any impact - configured in passive mode. • Can we confirm?
  34. Oops… • Turns out “passive” was pretty active. • Instead

    of configuring it to true passive, it was set to active with very high thresholds. • Our customer was crossing these thresholds.
  35. More questions? • Why did it just hang? • Device

    had no “rules” configured to handle threshold breaches. • Why was “passive” mode actually active? • A last minute decision by the network engineering team. • Passive mode would completely disable the device. • Wanted to observe the activity of the new device to tune.
  36. Lessons and follow-up • Turn on monitoring-side timeouts, with high

    thresholds. • Get notified for all maintenance events. • No errors are no fun. Keep looking. • Timeouts are necessary, but evil. Getting a late page is better than no page. No Error < Timeout Errors < Other Errors