Slide 1

Slide 1 text

Policies & Contracts In Distributed Systems Pune User Group 11th Feb 2023 Prathamesh Sonpatki Last9.io @prathamesh2_ 1

Slide 2

Slide 2 text

2 As a Developer - I want to write more code - I want to fix all the (existing*) bugs - I want to write bug free code - I want to use latest and greatest tools - I want to integrate with best in class tools - But ….

Slide 3

Slide 3 text

3 As a DevOps Engineer - I want to make sure that infrastructure scales - I want to make sure that application utilizes resources efficiently - I want to make sure that infrastructure cost is not out of control - I want to make sure …. - But …

Slide 4

Slide 4 text

4 As a TL/EM/DL - I want to make sure that my team meets the deadlines - I want to make sure that features work as expected - I want to make sure that code/infra is performant enough - I want to make sure that tech debt is not out of control - I want to make sure that team is motivated enough - I want to make sure …. - But …

Slide 5

Slide 5 text

5 As a Business/Product leader - I want to make sure team velocity is not slow - I want to make sure that external commitments are met - I want to make sure that product is getting adopted - I want to make sure that customer needs and expectations are considered by the product and engineering team - But …

Slide 6

Slide 6 text

6 But!

Slide 7

Slide 7 text

7 Rasmussen’s model of how accidents happen

Slide 8

Slide 8 text

8 BOOM!

Slide 9

Slide 9 text

9 Failures will happen.. - Every stakeholder has a boundary and a limit - One of the stakeholders pushes other’s boundaries too much! Eg. - Business pushing for features rollout resulting into too much of tech debt for engineering - Engineering chasing perfection slowing down delivery and velocity

Slide 10

Slide 10 text

10 Boundaries Something that indicates or fixes a limit or extent

Slide 11

Slide 11 text

11 Boundaries… - Team constraints - Quality of work - Time? - Perfection? - Cost? - Pricing? - Time to market?

Slide 12

Slide 12 text

12 Negotiation - We will be able to release this but with few bugs - We can do this but with increased AWS bill 💵 - We will be roll this out to new customers if team works overnight on weekend - We will be able to fix those bugs causing that main customer to go away if we de-prioritize the feature pipeline - We will be able to move faster if we add one more backend developer to the team

Slide 13

Slide 13 text

13 Negotiation → Contracts A written or spoken enforceable agreement

Slide 14

Slide 14 text

14 󰱢API Interface A written agreement about what the endpoint will return

Slide 15

Slide 15 text

15 Promises of POST /users endpoint - ✅ HTTP Status 201 - ❌ HTTP Status 400 - ⛔ HTTP Status 401 - ❓HTTP Status 404

Slide 16

Slide 16 text

16 Runtime interface A written agreement about how the endpoint will behave at runtime

Slide 17

Slide 17 text

17 Promises of POST /users - Uptime 90% - 10% requests are allowed to fail - Every Weekend, 20% requests are allowed to fail - During peak hours, the Latency can vary between 1000ms-5000ms

Slide 18

Slide 18 text

18 Runtime Interface Service Level Objectives

Slide 19

Slide 19 text

19 Service Level Objectives - Availability will be > 99.99% over 1 Day - Latency will be < 4000 ms over 7 Days - Uptime will be 98% over 7 Days

Slide 20

Slide 20 text

20 Service Level Objectives - What is the error rate on this checkout flow? - Can we promise 99.9% availability to this enterprise customer? - Should we prioritize tech-debt over new features? - Where should engineering focus for the next sprint? - What’s the success rate of this payment gateway?

Slide 21

Slide 21 text

21 Runtime Interface Promises 󰩕 Service Level Objectives

Slide 22

Slide 22 text

22 Promises → Policies

Slide 23

Slide 23 text

23 Policies - Set right expectations on what’s possible - Buy in from multiple stakeholders - Framework for communication between stakeholders - External client communication - Helps in Build v/s Buy decisions

Slide 24

Slide 24 text

24 Tiered Services - P0, P1, P2 - Different expectations from different tiers - Not every service is priority!

Slide 25

Slide 25 text

25 Ladder of Reliability - You can’t improve what you can’t measure - First Baseline! - Go one ladder at a time - 90% -> 95%-> 99 % ✅ - 90% -> 99.999% 😭

Slide 26

Slide 26 text

26 Rasmussen’s model of how accidents happen

Slide 27

Slide 27 text

27 Boundaries (still) exist But not broken! The tension(contracts and policies) keeps them in balance 󰜊

Slide 28

Slide 28 text

28 󰠘💰🌈👫

Slide 29

Slide 29 text

Thanks 🤝 29 Prathamesh Sonpatki 9⃣ Last9.io 󰜼 prathamesh.tech 🐧 twitter.com/prathamesh2_ 🐘 hachyderm.io/@Prathamesh “Last9 of Reliability” Discord