Slide 85
Slide 85 text
V6-21
© 2021 Hound Technology, Inc. All Rights Reserved. 85
85
Honeycomb Ingest Outage
● In November, we were working on OTLP and gRPC ingest support
● Let a commit deploy that attempted to bind to a privileged port
● Stopped the deploy in time, but scale-ups were trying to use the new build
● Latency shot up, took more than 10 minutes to remediate, blew our SLO
Except it was binding on a privileged port, and crashing on startup.
We managed to stop the deploy in time, thanks to a previous outage we had where we pushed out a bunch of binaries that didn’t build, so we had some health checks in place that would stop it from rolling out any further. That’s the good news. The bad news is, the new workers that were starting up were getting the new binary, and those
workers were failing to serve traffic.
And not only that, but because they weren’t serving traffic the cpu usage was zero. So aws autoscaler was like hey, let’s take that service and turn it down, you aren’t using it. So latency facing our end users went really high, and took us more than 10 minutes to remediate, which blew our SLO error budget