Slide 3
Slide 3 text
chronology:
- in may we had an incident. we knew an incident had just started because our end-to-end checks, which run every
minute making sure that writes that are made into honeycomb are readable - started failing.
- for the ~24 minutes of the incident, we identified that our MySQL RDS instance had suddenly started throwing "too
many connections!" error 1040s, which was in turn causing our API and web app to refuse the majority of received
requests.
- i’ll get into details later, but it was one of those incidents for which we were able to identify the issue, come up with one
hypothesis to address the issue, roll it out, realize it didn't help, come up with another hypothesis... before the system
seemed to right itself.