Our systems are growing, not only in size but also in complexity. There are more and more relationships between systems, often via fragile network connections. We’re increasingly integrating with systems outside of our control. Not only that, but these systems are more dynamic. While we increase expectations of uptime, we’ve also continued to increase the communication entropy in the system. Many systems now change by the hour. And this only captures a portion of the complexity. A question keeps getting asked that we struggle to answer: what is happening?
What does our system actually look like right now? What did it look like an hour ago? What is it going to look like in another hour? And that is just the structure of the system. There are many more dimensions we’re interested in. Is our system healthy? What does healthy even mean? Does the status, state, or health mean the same thing to you, your boss, operations, or engineering? And most importantly, what does all of this mean to our users?
These questions have led to a tooling explosion. We will walk through some of these tools and how they can help. We’ll also call out the gaps in these tools that appear when they are applied to practical use. We’ll discuss the perspectives and categories of tooling that we need. We’ll finish by focusing on the foundational actions we can take now to best position us to adapt, as our systems and tools change.