Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Reliability of Distributed Systems
Search
Piyush Verma
June 22, 2019
Technology
300
0
Share
Reliability of Distributed Systems
Piyush Verma
June 22, 2019
More Decks by Piyush Verma
See All by Piyush Verma
SLOs that Lie
meson10
0
130
Doing SRE the right way - 2
meson10
0
170
Doing SRE the right way
meson10
0
1k
Observability and Control Theory
meson10
1
1.1k
Reliability
meson10
0
170
My TLS was broken
meson10
0
150
Technology that builds Organizations
meson10
0
150
Namespace.go
meson10
0
180
Cgroups and Namespaces in Linux
meson10
0
410
Other Decks in Technology
See All in Technology
そのSLO 99.9%、本当に必要ですか? 〜優先度付きSLOによる責任共有の設計思想〜 / Is that 99.9% SLO really necessary? Design philosophy of shared responsibility through prioritized SLOs
vtryo
0
770
マンション備え付けのネットワークとLTE回線を組み合わせた ネットワークの安定化の考案
harutiro
1
130
"うちにはまだ早い"は本当? ─ 小さく始めるPlatform Engineering入門
harukasakihara
6
610
JTCでRedmine利用者2700人を実現した手法 第二部
nobuonakamura
0
110
AI 時代の Platform Engineering
recruitengineers
PRO
1
210
LookerとADKで作る社内AIエージェント
chanyou0311
0
240
Sansan Engineering Unit 紹介資料
sansan33
PRO
1
4.4k
SpeechTranscriber + AIによる文字起こし機能
kazuki1220
0
110
ECSのTerraformモジュールにコントリビュートした話
harukasakihara
0
210
Databricks 月刊サービスアップデートまとめ 2026年04月号
tyosi1212
0
130
バイブコーディング、仕様駆動、その先へ - 「不確実性に対する検査‧適応のサイクル」を設計する
littlehands
1
510
インプロセスQAのための要因から捉えるプロジェクトリスクマネジメントnano #1 開発リソース効率状態への対処 #jasstnano
barus_qa
0
130
Featured
See All Featured
Build your cross-platform service in a week with App Engine
jlugia
234
18k
How to Ace a Technical Interview
jacobian
281
24k
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
370
Why Mistakes Are the Best Teachers: Turning Failure into a Pathway for Growth
auna
0
130
Music & Morning Musume
bryan
47
7.2k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
128
55k
Documentation Writing (for coders)
carmenintech
77
5.3k
Claude Code どこまでも/ Claude Code Everywhere
nwiizo
65
55k
Winning Ecommerce Organic Search in an AI Era - #searchnstuff2025
aleyda
1
2k
Noah Learner - AI + Me: how we built a GSC Bulk Export data pipeline
techseoconnect
PRO
0
180
<Decoding/> the Language of Devs - We Love SEO 2024
nikkihalliwell
1
210
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.8k
Transcript
Reliability of Distributed Systems - Piyush Verma
Every product either dies a hero or lives long enough
to hit Reliability issues. 2
Customer Empathy No Chooran. Cost to Everything Architectures adapt to
$ Priority 01 02 03 04 3 Take it and Go
Service receives SMS User sends SMS Remind me to buy
milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 4 Sample Product
Service receives SMS User sends SMS Remind me to buy
milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 5 Sample Product: Inbound
Cron gets Activated when time is right. Call the User
6 Sample Product: Outbound Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308
7 Inbound Connection
— Leslie Lamport https://www.microsoft.com/en-us/research/uploads/prod/2016 /12/Distribution.pdf “A distributed system is one
in which the failure of a computer you didn’t even know existed can render your own computer unusable” 8
Four Flavors of Failure 9 Disk Network CPU Memory
Network is Reliable Intra-LAN latency is ~ Zero Network is
Homogeneous Network cost is Zero 01 02 03 04 10
11 Scope of Failures: Again
At-least one server is online All servers are below 100%
All servers are responding within x ms. All of the above. 01 02 03 04 12
13 #1 Server is Unavailable
14 Replication Available
15 Replication Available
16 Available != Load Balanced
17 Load Balanced
18 Architecture of a Balancer
GCP AWS On-Prem Azure Who uses What? 19
Trilemma 20
Trilemma 21 Available Economical Endurable
22 Available + Load Balanced
Load Balancing 23
24 Monty Hall Problem: Was Marilyn vos Savant, right?
25 Server-side Load Balancing Example: Fabio
26 Look-aside Load Balancing Example: Consul/ DNS
27 Client-side Load Balancing Example: Ribbon
28 Client-side Load Balancing Example: Ribbon + Curator
29 Problems
30 Load Shedding
— Tyler McMullen https://www.infoq.com/presentations/load-balancing/ “Load Balancing is almost Impossible” 31
Alternate Reliability 32
33 Asynchronous Architectures
34 Asynchronous Architectures Example: RabbitMQ Kafka Kinesis SQS
Cron gets Activated when time is right. Call the User
35 Sample Product: Outbound Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308 Part 1
36 Outbound
37 Scope of Failure: Outbound
38 Retries
39 Retries: Transient Failures
40 Exponential Backoff: Short term Transient Failures ✋ ✋ ✋
✋ ✋
41 Circuit Breaking: Long Term Transient Failures
42 Revisited
Dilemma 43 At-least Once Exactly-Once At-most Once
44 At-most once delivery
45 At-least once delivery
46 Exactly once delivery
47 Exactly once delivery = At-least-once Delivery + Exactly-once Processing
Keys to Only-Once delivery 48 Atomic Window Idempotent
Out-of-Order delivery 49
Revisit 50
Service receives SMS User sends SMS Remind me to buy
milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 51 Sample Product
52 Problems of State
53 Problems of State
54 Locked /Serialization
55 Master/ Master/ Slave
56 Clustering
Scalability 57 Data Replication Reduced Communication Logic/Data Decentralization
CAP Theorem [Sab topi pehna rahe] 58
59 Trilemma Available Partition Consistent
PACeLC Theorem 60
Dilemma 61 Consistency Latency
62 Revised Flow
63 What about Spanner? What about Calvin?
Reliable System 64 Scalable Correct Transparent
Access Transparency Location Transparency Concurrency Transparency Failure Transparency 01 02
03 04 65
Size Scalability Geographical Scalability 01 02 66
Summary 67 Consistent Available Economical Low Latency
68 All Put Together
Embrace your Bugs No Silver Bullet Cost to Everything Product
First 01 02 03 04 69
Does anyone have any questions?
[email protected]
Thanks 70