Chaos is in the name, but it can be controlled.
It doesn’t have to be dangerous. Can be thoughtful and safe.
It’s useful not only in production, but also good in staging and other earlier environments.
It doesn’t have to be a giant cross-team entire company exercise.
You can start small.
Shutdown your stateless hosts. See that they come back up healthy.
What happens when you drop connection to your data store? Let’s verify the retry and timeout behaviors.
Auto-scaling may sound easy, but there are nuances that you have to really experience it in order to know.
Ensure alerts are triggered appropriately, and that the receiver has sufficient information to work with. Take signal to noise into consideration.
Past Incidents, with your team, with a third party dependency. Are good lessons to share with your team. Re-run those scenarios.