Assist in “Digital Transformation” • Share Architectural Best Practices • Share Operational Best Practices • Facilitate GameDays Quite possibly the only Solutions Architect who became a Technical Account Manager at AWS. @HoReaL @GremlinInc
bugs, fix bugs, build new features, found bugs, fix bugs, build new features, P0 hard down, on call hero saves the day, build new features, P1 incident, fix bugs, P0 hard down, customer complains, fix bugs, new bugs show up, fix new bugs, P2 issue, build new features, P1 incident, fix bugs, product release, P2 issue came back as P0 hard down… Frustrated customer looks for alternative, churn rate increases, business struggles…
Municipal Transportation Agency is stepping down amid the fallout from a 10-hour meltdown that choked the city on Friday, drawing anger from City Hall.
How do we remove/patch hosts? Can we replace smaller host to a bigger host? Can we scale out/in to add/remove more hosts? Any auto-healing mechanism if hosts fail health check? Is “S.T.O.N.I.T.H.” in our toolbox?
threshold do we trigger scaling? How long does it take to scale? What is the user experience upon encountering failure? How can we improve this user experience?
Control (TC) $ tc qdisc add dev eth0 root netem delay 1000ms 500ms Iptable iptables -A OUTPUT -p tcp -d 157.240.0.0/16 -j DROP PF (Mac) block quick from any to 157.240.0.0/16
Network Failures Unknown → Known FOCUS APPROACH DESIRE Business Metrics User Experience Automated Experiments New Manual Experiments Verifiable Resilience!