When building resilient, fault-tolerant, scalable systems, we focus quite a bit on the particular technologies involved. Can it scale horizontally? Is Samza better than Storm? Is this library thread-safe? It turns out that, even though those questions matter to the stability of the system, they don’t matter as much as the people building the system. Humans choose the stack, write the code, and write the bugs, too. They create the weird edge cases that cause the system to fall over at the worst time.
At New Relic we’ve taken an unusual approach to building software: we draw heavily from biological metaphors like mutation and natural selection, and focus on a human-centric approach to define our architecture. Rather than trust a few armchair architects to make the decisions, we put the power in the hands of the teams wrestling with the code. We have many strategies to ensure cohesiveness across the architecture and scalability for the business, the engineering organization, and the software, but it takes a little leap of faith and a lot of trust to move to a process like ours.
I’ll share how our process works, and how we manage the growth without going off the rails, while increasing system stability. In the end, being a good architect is more about working with humans than it is about working with code. You can trust your engineers -- you just may not believe it yet.
Delivered to the O'Reilly Software Architecture Conference, April 2016