Stop firefighting and start fireproofing! There are many tools that make oncall easier and increase availability, but we'll be mostly focusing on a few principles and design patterns that help make your systems more robust.
link failures/day ◎ ~59k packet loss per failure Google Chubby2 Over 700 days of operation: ◎ 4 outages due to network maintenance ◎ 2 outages due to “suspected network connectivity problems” 1Understanding Network Failures in Data Centers (Gill, Jain, Nagappan) 2The Chubby lock service for loosely-coupled distributed systems (Mike Burrows)
link failures/day ◎ ~59k packet loss per failure Google Chubby2 Over 700 days of operation: ◎ 4 outages due to network maintenance ◎ 2 outages due to “suspected network connectivity problems” Jeff Dean3 New Google cluster’s first year typically sees: ◎ 40-80 machines have 50% packet loss ◎ 8 network maintenances ◎ 3 router failures 1Understanding Network Failures in Data Centers (Gill, Jain, Nagappan) 2The Chubby lock service for loosely-coupled distributed systems (Mike Burrows) 3Design Lessons and Advice from Building Large Scale Distributed Systems (Jeff Dean)
implement message loss UDP, Kafka At Least Once no message loss potential double processing Kafka, SQS, RabbitMQ Exactly Once what you really want significant coordination overhead ?
implement message loss UDP, Kafka At Least Once no message loss potential double processing Kafka, SQS, RabbitMQ Exactly Once* what you really want significant coordination overhead ?
again. Fail again. Fail better. — Samuel Beckett Idempotency It’s déjà vu all over again. Circuit Breakers I promise you’re not my fallback option. Monitoring You can’t fix what you can’t see. Queues Not just for the Brits.
1.0). This presentations uses the following typographies and colors: ◎ Titles: Roboto Slab ◎ Body copy: Source Sans Pro ◎ Blue #0091ea ◎ Dark gray #263238 ◎ Medium gray #607d8b ◎ Light gray #cfd8dc