Pro Yearly is on sale from $80 to $50! »

Tales from the Ops Side - My Favourite Errors

Tales from the Ops Side - My Favourite Errors

Errors can be a source of stress in our professional lives. More often than not its discovery can cause us to drop what we're doing to investigate. As an Ops person, I've come to accept errors as fact of life, and even come to appreciate them. Knowing your errors can help you pinpoint the source of the problem quickly. Over the years I've compiled a large list of helpful (and not helpful) error messages. This session goes through my favourite ones, along with interesting tales of troubleshooting and resolutions.

931b1f4eca1ffa6a6ad93ed8513c6da3?s=128

VM Farms Inc.

May 29, 2019
Tweet

Transcript

  1. Hany Fahim Founder & CEO @iHandroid My Favourite Errors Tales

    From The Ops Side
  2. I love errors

  3. Errors are a part of the job • Must embrace

    the fact that things will break. • Come to appreciate various errors and what they mean. • When the pager goes off, first question is “what is the error?” • Knowing what error messages mean can greatly help speed up an investigation. • It hints at where to look first to save valuable time.
  4. Standard Stack

  5. Pager goes off…

  6. ? First action:
 Does it ping? $ ping mysite.org I

    use the fully qualified
 domain name (FQDN) Results may reveal a
 problem high up in the stack
  7. Non-existant/ expired Domain $ ping mysite.org

  8. Why I like this error • Reveals the problem right

    away. • May need renewal, or DNS may be down. • Can trick monitoring systems. • Follow-up: Add domain expiry monitoring.
  9. No route to host

  10. Why I like this error • Tells you there’s a

    routing problem somewhere between you and the site. • Use traceroute to pin-point the problem. • Might be my local ISP, my cloud provider, or somewhere in between. • Time to call upstream.
  11. Time-to-live exceeded

  12. Why I like this error • Points me to the

    network provider as the likely culprit. • Can be caused by misconfiguration, failed hardware, or overloads. • Problem may lay anywhere between your local ISP, cloud provider, or somewhere in between. • traceroute should reveal the problem routes.
  13. $ traceroute -n mysite.org 1 10.0.1.1 1.682 ms 1.665 ms

    1.124 ms 2 * * * 3 * * * 4 104.195.128.21 16.295 ms 16.854 ms 12.788 ms 5 206.248.155.94 15.829 ms 206.248.155.17 14.371 ms 22.065 ms 6 198.32.181.56 18.305 ms 13.123 ms 19.671 ms 7 198.32.181.51 19.665 ms 12.031 ms * 8 198.32.181.56 18.631 ms 19.121 ms * 9 198.32.181.51 12.920 ms 14.301 ms * … 64 198.32.181.51 16.259 ms 16.343 ms * Most systems stop at 64 hops $ sysctl net.inet.ip.ttl net.inet.ip.ttl: 64 Notice these hops
 are bouncing back
 and forth to each
 other.
  14. Connection reset by peer

  15. Why I like this error • Points me to the

    proxy as the likely culprit. • Issue may be transient (like a process restarting) • Persistent errors may indicate an overload. • Check the kernel logs on the proxy (if I can).
  16. Connection refused

  17. Why I like this error • Points to the proxy

    as the culprit. • Might be due to a downed or crashed process (out-of- memory). • Check the proxy!
  18. Connection timed out (TCP)

  19. Why I like this error • I don’t.

  20. Why I hate this error • Problem can be anywhere.

    • Can indicate an overload, misconfiguration (firewall, proxy, app, DB), rate limiting, network problem, local system problem, etc… • Hope that a better error message is sitting somewhere.
  21. TCP errors are easy to diagnose, and easy to fix*.

    * Usually.
  22. However most errors I’ve seen are HTTP errors. HTTP errors

    are higher up the stack and are more complex to diagnose and resolve.
  23. None
  24. 4xx

  25. HTTP 400

  26. Why I like this error • Indicates the client issued

    an invalid request (most likely no request at all). • Could indicate an interrupted connection (connection reset by peer). • Could also be caused by too large a request (8kb limit on HAProxy) • Check to see if there are client-side errors to accompany the HTTP 400.
  27. HTTP 401 HTTP 403

  28. HTTP 402

  29. Why I like this error • Pay your bills. •

    Some providers (like Google Developers API) use 402’s to notify you of request limits. • Shopify uses this to notify of suspended accounts due to non-payment.
  30. HTTP 404

  31. Why I like this error • If unexpected, may be

    due to file permissions. • If requests are all from bots (and are persistent), could be due to broken links. • If logs are flooded, could be a brute force attack. • Sites have gone down due to 404s.
  32. HTTP 405

  33. Why I like this error • Tells you the client

    is trying to reach the wrong endpoint. • Seen often on strict API endpoints. • If this is unexpected, it could indicate a misconfiguration between the application and web server.
  34. HTTP 408

  35. Why I like this error • Unlike other types of

    timeouts, this almost always indicates a client-side issue. • Recent years, this has cropped up more and more due to “TCP pre-connect” optimizations in browsers. • To avoid client’s seeing this error, tune the proxy to send empty responses.
  36. HAProxy “workaround” errorfile 408 /dev/null This will send a completely


    empty response to the client.
  37. HTTP 499 nginx-specific error nginx detects it’s
 client hung up.

  38. Why I like this error • I don’t really. •

    It’s an nginx-specific error, but likely is coupled with a different error elsewhere. • Could indicate the end-client giving up on the request (user clicked “Stop” in their browser). • Could indicate a timeout at the proxy. • Keep looking…
  39. 5xx

  40. HTTP 502 LOG upstream prematurely closed connection while reading response

    header from upstream kevent() reported that connect() failed (61: Connection refused) while connecting to upstream
  41. Why I like this error • Could indicate an issue

    with nginx (unlikely) or the application (likely). • Often seen when applications hard-crash (out-of- memory, unexpected error). • Also seen when application servers timeout (i.e.: gunicorn timeouts). • Check application and kernel logs.
  42. HTTP 503

  43. Why I like this error • Usually manually triggered. •

    Could also be due to overload (no more slots to serve traffic). • Most likely transient.
  44. HTTP 504

  45. Why I like this error • I don’t.

  46. Why I hate this error • Can be very confusing,

    especially when app logs are clean. • Can be due to mis-matched timeouts along the stack. • Like most other “timeouts”, true issue may be hidden.
  47. My all-time favourite error…

  48. HTTP 500

  49. Why I LOVE this error • My all-time favourite error.

    • Always comes from the application. • Because it’s generated from the application, errors are usually more helpful. • Even generic errors are helpful since you know it’s coming from elsewhere in the stack. May occur higher-up in the stack with custom modules
 (i.e.: Lua processing).
  50. Honourable Mentions • couldn't obtain random bytes • operation not

    permitted • socket timeout after 10 seconds • segmentation fault
  51. My all-time least favourite error…

  52. No error…

  53. ! Client notifies us that their site was down.

  54. ? All green. No alerts or errors being
 reported.

  55. None
  56. Questions… • Why was this working for some clients and

    not others? • Called other team members, it was working for them too. • Taking a look at the graphs…
  57. External Traffic 21:00 23:00 01:00 03:00 05:00 07:00 09:00 11:00

    13:00 15:00 17:00 19:00
  58. More questions… • Works for some, but not others? •

    Team member noted site wasn’t rendering on their phone. Confirmed!
  59. Could this indicate a routing problem? Phone was on wifi.

  60. More questions… • Same network, different clients, different results? •

    What was the difference? • Remembered laptop auto-connects to our VPN.
  61. The fact that it was working rules out a routing

    problem. Where is the error? Why was it hanging?
  62. Try telnet $ telnet brokensite.net 80 Trying x.x.x.x... Connected to

    brokensite.net. Escape character is '^]'. GET / HTTP/1.0 Something is responding! No response.
 Not even a timeout. The IP is correct.
  63. ? Is the request actually making it to the proxy?

  64. 14:02:02.779102 IP y.y.y.y.47553 > x.x.x.x.80: Flags [P.], seq 3426796935:3426797367, ack

    834604261, win 4117, options [nop,nop,TS val 230425600 ecr 4106181590], length 432 E....H@.4.......k......Q.@..1.............. .....[.GET / HTTP/1.1 Host: brokensite.net Connection: keep-alive Upgrade-Insecure-Requests: 1 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/ *;q=0.8,application/signed-exchange;v=b3 Accept-Encoding: gzip, deflate Accept-Language: en-US,en;q=0.9,gl;q=0.8 ^C 1 packets captured 3 packets received by filter 0 packets dropped by kernel On the proxy… $ tcpdump -A -s 0 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' -i eth0 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 14:02:02.779102 IP y.y.y.y.47553 > x.x.x.x.80: Flags [P.], seq 3426796935:3426797367, ack 834604261, win 4117, options [nop,nop,TS val 230425600 ecr 4106181590], length 432 E....H@.4.......k......Q.@..1.............. .....[.GET / HTTP/1.1 Host: brokensite.net Connection: keep-alive Upgrade-Insecure-Requests: 1 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/ *;q=0.8,application/signed-exchange;v=b3 Accept-Encoding: gzip, deflate Accept-Language: en-US,en;q=0.9,gl;q=0.8 ^C 1 packets captured 3 packets received by filter 0 packets dropped by kernel Fancy way to sniff traffic and get full
 request/response cycle. Normal request coming through. Absolutely nothing came through.
  65. Who am I talking to? • Where is my manual

    request? • Not even getting standard TCP packets (SYN, SYN-ACK, ACK) for established connections. • Who am I connecting to?
  66. Phoned upstream • Any reported issues? • Nothing usual. •

    Except for the non-service impacting maintenance.
  67. DDoS Protection • Upstream’s upstream was installing new DDoS mitigation

    hardware. • Should not have any impact - configured in passive mode. • Can we confirm?
  68. DDoS Protection Should be configured to “passive” mode.

  69. Oops… • Turns out “passive” was pretty active. • Instead

    of configuring it to true passive, it was set to active with very high thresholds. • Our customer was crossing these thresholds.
  70. More questions? • Why did it just hang? • Device

    had no “rules” configured to handle threshold breaches. • Why was “passive” mode actually active? • A last minute decision by the network engineering team. • Passive mode would completely disable the device. • Wanted to observe the activity of the new device to tune.
  71. Lessons and follow-up • Turn on monitoring-side timeouts, with high

    thresholds. • Get notified for all maintenance events. • No errors are no fun. Keep looking. • Timeouts are necessary, but evil. Getting a late page is better than no page. No Error < Timeout Errors < Other Errors
  72. Hany Fahim Founder & CEO @iHandroid Thank you.