Policies & Contracts
In Distributed Systems
Pune User Group 11th Feb 2023
Prathamesh Sonpatki
Last9.io
@prathamesh2_
1
Slide 2
Slide 2 text
2
As a Developer
- I want to write more code
- I want to fix all the (existing*) bugs
- I want to write bug free code
- I want to use latest and greatest tools
- I want to integrate with best in class tools
- But ….
Slide 3
Slide 3 text
3
As a DevOps Engineer
- I want to make sure that infrastructure scales
- I want to make sure that application utilizes resources efficiently
- I want to make sure that infrastructure cost is not out of control
- I want to make sure ….
- But …
Slide 4
Slide 4 text
4
As a TL/EM/DL
- I want to make sure that my team meets the deadlines
- I want to make sure that features work as expected
- I want to make sure that code/infra is performant enough
- I want to make sure that tech debt is not out of control
- I want to make sure that team is motivated enough
- I want to make sure ….
- But …
Slide 5
Slide 5 text
5
As a Business/Product leader
- I want to make sure team velocity is not slow
- I want to make sure that external commitments are met
- I want to make sure that product is getting adopted
- I want to make sure that customer needs and expectations are
considered by the product and engineering team
- But …
Slide 6
Slide 6 text
6
But!
Slide 7
Slide 7 text
7
Rasmussen’s model of how accidents happen
Slide 8
Slide 8 text
8
BOOM!
Slide 9
Slide 9 text
9
Failures will happen..
- Every stakeholder has a boundary and a limit
- One of the stakeholders pushes other’s boundaries too much!
Eg.
- Business pushing for features rollout resulting into too much of tech debt
for engineering
- Engineering chasing perfection slowing down delivery and velocity
Slide 10
Slide 10 text
10
Boundaries
Something that indicates or fixes a limit or extent
Slide 11
Slide 11 text
11
Boundaries…
- Team constraints
- Quality of work
- Time?
- Perfection?
- Cost?
- Pricing?
- Time to market?
Slide 12
Slide 12 text
12
Negotiation
- We will be able to release this but with few bugs
- We can do this but with increased AWS bill 💵
- We will be roll this out to new customers if team works overnight on
weekend
- We will be able to fix those bugs causing that main customer to go away
if we de-prioritize the feature pipeline
- We will be able to move faster if we add one more backend developer to
the team
Slide 13
Slide 13 text
13
Negotiation → Contracts
A written or spoken enforceable agreement
Slide 14
Slide 14 text
14
API Interface
A written agreement about what the endpoint will return
Slide 15
Slide 15 text
15
Promises of POST /users endpoint
- ✅ HTTP Status 201
- ❌ HTTP Status 400
- ⛔ HTTP Status 401
- ❓HTTP Status 404
Slide 16
Slide 16 text
16
Runtime interface
A written agreement about how the endpoint
will behave at runtime
Slide 17
Slide 17 text
17
Promises of POST /users
- Uptime 90%
- 10% requests are allowed to fail
- Every Weekend, 20% requests are allowed to fail
- During peak hours, the Latency can vary between 1000ms-5000ms
Slide 18
Slide 18 text
18
Runtime Interface
Service Level Objectives
Slide 19
Slide 19 text
19
Service Level Objectives
- Availability will be > 99.99% over 1 Day
- Latency will be < 4000 ms over 7 Days
- Uptime will be 98% over 7 Days
Slide 20
Slide 20 text
20
Service Level Objectives
- What is the error rate on this checkout flow?
- Can we promise 99.9% availability to this enterprise customer?
- Should we prioritize tech-debt over new features?
- Where should engineering focus for the next sprint?
- What’s the success rate of this payment gateway?
Slide 21
Slide 21 text
21
Runtime Interface
Promises
Service Level Objectives
Slide 22
Slide 22 text
22
Promises → Policies
Slide 23
Slide 23 text
23
Policies
- Set right expectations on what’s possible
- Buy in from multiple stakeholders
- Framework for communication between stakeholders
- External client communication
- Helps in Build v/s Buy decisions
Slide 24
Slide 24 text
24
Tiered Services
- P0, P1, P2
- Different expectations from different tiers
- Not every service is priority!
Slide 25
Slide 25 text
25
Ladder of Reliability
- You can’t improve what you can’t measure
- First Baseline!
- Go one ladder at a time
- 90% -> 95%-> 99 % ✅
- 90% -> 99.999% 😭
Slide 26
Slide 26 text
26
Rasmussen’s model of how accidents happen
Slide 27
Slide 27 text
27
Boundaries (still) exist
But not broken! The tension(contracts and policies) keeps them in balance