Tales from the Ops Side - My Favourite Errors

Hany Fahim Founder & CEO @iHandroid My Favourite Errors Tales
From The Ops Side

I love errors

Errors are a part of the job • Must embrace
the fact that things will break. • Come to appreciate various errors and what they mean. • When the pager goes off, ﬁrst question is “what is the error?” • Knowing what error messages mean can greatly help speed up an investigation. • It hints at where to look ﬁrst to save valuable time.

Standard Stack

Pager goes oﬀ…

? First action:  Does it ping? $ ping mysite.org I
use the fully qualiﬁed  domain name (FQDN) Results may reveal a  problem high up in the stack

Non-existant/ expired Domain $ ping mysite.org

Why I like this error • Reveals the problem right
away. • May need renewal, or DNS may be down. • Can trick monitoring systems. • Follow-up: Add domain expiry monitoring.

No route to host

Why I like this error • Tells you there’s a
routing problem somewhere between you and the site. • Use traceroute to pin-point the problem. • Might be my local ISP, my cloud provider, or somewhere in between. • Time to call upstream.

Time-to-live exceeded

Why I like this error • Points me to the
network provider as the likely culprit. • Can be caused by misconﬁguration, failed hardware, or overloads. • Problem may lay anywhere between your local ISP, cloud provider, or somewhere in between. • traceroute should reveal the problem routes.

$ traceroute -n mysite.org 1 10.0.1.1 1.682 ms 1.665 ms
1.124 ms 2 * * * 3 * * * 4 104.195.128.21 16.295 ms 16.854 ms 12.788 ms 5 206.248.155.94 15.829 ms 206.248.155.17 14.371 ms 22.065 ms 6 198.32.181.56 18.305 ms 13.123 ms 19.671 ms 7 198.32.181.51 19.665 ms 12.031 ms * 8 198.32.181.56 18.631 ms 19.121 ms * 9 198.32.181.51 12.920 ms 14.301 ms * … 64 198.32.181.51 16.259 ms 16.343 ms * Most systems stop at 64 hops $ sysctl net.inet.ip.ttl net.inet.ip.ttl: 64 Notice these hops  are bouncing back  and forth to each  other.

Connection reset by peer

Why I like this error • Points me to the
proxy as the likely culprit. • Issue may be transient (like a process restarting) • Persistent errors may indicate an overload. • Check the kernel logs on the proxy (if I can).

Connection refused

Why I like this error • Points to the proxy
as the culprit. • Might be due to a downed or crashed process (out-of- memory). • Check the proxy!

Connection timed out (TCP)

Why I like this error • I don’t.

Why I hate this error • Problem can be anywhere.
• Can indicate an overload, misconﬁguration (ﬁrewall, proxy, app, DB), rate limiting, network problem, local system problem, etc… • Hope that a better error message is sitting somewhere.

TCP errors are easy to diagnose, and easy to ﬁx*.
* Usually.

However most errors I’ve seen are HTTP errors. HTTP errors
are higher up the stack and are more complex to diagnose and resolve.

HTTP 400

Why I like this error • Indicates the client issued
an invalid request (most likely no request at all). • Could indicate an interrupted connection (connection reset by peer). • Could also be caused by too large a request (8kb limit on HAProxy) • Check to see if there are client-side errors to accompany the HTTP 400.

HTTP 401 HTTP 403

HTTP 402

Why I like this error • Pay your bills. •
Some providers (like Google Developers API) use 402’s to notify you of request limits. • Shopify uses this to notify of suspended accounts due to non-payment.

HTTP 404

Why I like this error • If unexpected, may be
due to ﬁle permissions. • If requests are all from bots (and are persistent), could be due to broken links. • If logs are ﬂooded, could be a brute force attack. • Sites have gone down due to 404s.

HTTP 405

Why I like this error • Tells you the client
is trying to reach the wrong endpoint. • Seen often on strict API endpoints. • If this is unexpected, it could indicate a misconﬁguration between the application and web server.

HTTP 408

Why I like this error • Unlike other types of
timeouts, this almost always indicates a client-side issue. • Recent years, this has cropped up more and more due to “TCP pre-connect” optimizations in browsers. • To avoid client’s seeing this error, tune the proxy to send empty responses.

HAProxy “workaround” errorfile 408 /dev/null This will send a completely 
empty response to the client.

HTTP 499 nginx-speciﬁc error nginx detects it’s  client hung up.

Why I like this error • I don’t really. •
It’s an nginx-speciﬁc error, but likely is coupled with a diﬀerent error elsewhere. • Could indicate the end-client giving up on the request (user clicked “Stop” in their browser). • Could indicate a timeout at the proxy. • Keep looking…

HTTP 502 LOG upstream prematurely closed connection while reading response
header from upstream kevent() reported that connect() failed (61: Connection refused) while connecting to upstream

Why I like this error • Could indicate an issue
with nginx (unlikely) or the application (likely). • Often seen when applications hard-crash (out-of- memory, unexpected error). • Also seen when application servers timeout (i.e.: gunicorn timeouts). • Check application and kernel logs.

HTTP 503

Why I like this error • Usually manually triggered. •
Could also be due to overload (no more slots to serve trafﬁc). • Most likely transient.

HTTP 504

Why I like this error • I don’t.

Why I hate this error • Can be very confusing,
especially when app logs are clean. • Can be due to mis-matched timeouts along the stack. • Like most other “timeouts”, true issue may be hidden.

My all-time favourite error…

HTTP 500

Why I LOVE this error • My all-time favourite error.
• Always comes from the application. • Because it’s generated from the application, errors are usually more helpful. • Even generic errors are helpful since you know it’s coming from elsewhere in the stack. May occur higher-up in the stack with custom modules  (i.e.: Lua processing).

Honourable Mentions • couldn't obtain random bytes • operation not
permitted • socket timeout after 10 seconds • segmentation fault

My all-time least favourite error…

No error…

! Client notiﬁes us that their site was down.

? All green. No alerts or errors being  reported.

Questions… • Why was this working for some clients and
not others? • Called other team members, it was working for them too. • Taking a look at the graphs…

External Trafﬁc 21:00 23:00 01:00 03:00 05:00 07:00 09:00 11:00
13:00 15:00 17:00 19:00

More questions… • Works for some, but not others? •
Team member noted site wasn’t rendering on their phone. Conﬁrmed!

Could this indicate a routing problem? Phone was on wiﬁ.

More questions… • Same network, different clients, different results? •
What was the difference? • Remembered laptop auto-connects to our VPN.

The fact that it was working rules out a routing
problem. Where is the error? Why was it hanging?

Try telnet $ telnet brokensite.net 80 Trying x.x.x.x... Connected to
brokensite.net. Escape character is '^]'. GET / HTTP/1.0 Something is responding! No response.  Not even a timeout. The IP is correct.

? Is the request actually making it to the proxy?

14:02:02.779102 IP y.y.y.y.47553 > x.x.x.x.80: Flags [P.], seq 3426796935:3426797367, ack
834604261, win 4117, options [nop,nop,TS val 230425600 ecr 4106181590], length 432 [email protected][email protected].............. .....[.GET / HTTP/1.1 Host: brokensite.net Connection: keep-alive Upgrade-Insecure-Requests: 1 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/ *;q=0.8,application/signed-exchange;v=b3 Accept-Encoding: gzip, deflate Accept-Language: en-US,en;q=0.9,gl;q=0.8 ^C 1 packets captured 3 packets received by filter 0 packets dropped by kernel On the proxy… $ tcpdump -A -s 0 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' -i eth0 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 14:02:02.779102 IP y.y.y.y.47553 > x.x.x.x.80: Flags [P.], seq 3426796935:3426797367, ack 834604261, win 4117, options [nop,nop,TS val 230425600 ecr 4106181590], length 432 [email protected][email protected].............. .....[.GET / HTTP/1.1 Host: brokensite.net Connection: keep-alive Upgrade-Insecure-Requests: 1 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/ *;q=0.8,application/signed-exchange;v=b3 Accept-Encoding: gzip, deflate Accept-Language: en-US,en;q=0.9,gl;q=0.8 ^C 1 packets captured 3 packets received by filter 0 packets dropped by kernel Fancy way to sniﬀ traﬃc and get full  request/response cycle. Normal request coming through. Absolutely nothing came through.

Who am I talking to? • Where is my manual
request? • Not even getting standard TCP packets (SYN, SYN-ACK, ACK) for established connections. • Who am I connecting to?

Phoned upstream • Any reported issues? • Nothing usual. •
Except for the non-service impacting maintenance.

DDoS Protection • Upstream’s upstream was installing new DDoS mitigation
hardware. • Should not have any impact - conﬁgured in passive mode. • Can we conﬁrm?

DDoS Protection Should be conﬁgured to “passive” mode.

Oops… • Turns out “passive” was pretty active. • Instead
of conﬁguring it to true passive, it was set to active with very high thresholds. • Our customer was crossing these thresholds.

More questions? • Why did it just hang? • Device
had no “rules” conﬁgured to handle threshold breaches. • Why was “passive” mode actually active? • A last minute decision by the network engineering team. • Passive mode would completely disable the device. • Wanted to observe the activity of the new device to tune.

Lessons and follow-up • Turn on monitoring-side timeouts, with high
thresholds. • Get notiﬁed for all maintenance events. • No errors are no fun. Keep looking. • Timeouts are necessary, but evil. Getting a late page is better than no page. No Error < Timeout Errors < Other Errors

Hany Fahim Founder & CEO @iHandroid Thank you.

Tales from the Ops Side - My Favourite Errors

Tales from the Ops Side - My Favourite Errors

More Decks by VM Farms Inc.

Other Decks in Technology

Featured

Transcript