Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
There's no Clusterf*ck without a Cluster
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Dan Hopkins
April 19, 2014
Programming
190
1
Share
There's no Clusterf*ck without a Cluster
Dan Hopkins
April 19, 2014
More Decks by Dan Hopkins
See All by Dan Hopkins
Actors: not just for movies anymore
danielhopkins
1
160
Other Decks in Programming
See All in Programming
Going Multiplatform with Your Android App (Android Makers 2026)
zsmb
2
390
How We Benchmarked Quarkus: Patterns and anti-patterns
hollycummins
1
110
レガシーPHP転生 〜父がドメインエキスパートだったのでDDD+Claude Codeでチート開発します〜
panda_program
0
710
Kubernetes上でAgentを動かすための最新動向と押さえるべき概念まとめ
sotamaki0421
3
480
瑠璃の宝石に学ぶ技術の声の聴き方 / 【劇場版】アニメから得た学びを発表会2026 #エンジニアニメ
mazrean
0
240
[PHPerKaigi 2026]PHPerKaigi2025の企画CodeGolfが最高すぎて社内で内製して半年運営して得た内製と運営の知見
ikezoemakoto
0
340
「接続」—パフォーマンスチューニングの最後の一手 〜点と点を結ぶ、その一瞬のために〜
kentaroutakeda
5
2.5k
YJITとZJITにはイカなる違いがあるのか?
nakiym
0
200
SkillがSkillを生む:QA観点出しを自動化した
sontixyou
6
3.3k
ファインチューニングせずメインコンペを解く方法
pokutuna
0
300
Codex CLIのSubagentsによる並列API実装 / Parallel API Implementation with Codex CLI Subagents
takatty
2
890
Oxlintとeslint-plugin-react-hooks 明日から始められそう?
t6adev
0
200
Featured
See All Featured
AI Search: Where Are We & What Can We Do About It?
aleyda
0
7.3k
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
310
For a Future-Friendly Web
brad_frost
183
10k
End of SEO as We Know It (SMX Advanced Version)
ipullrank
3
4.1k
Large-scale JavaScript Application Architecture
addyosmani
515
110k
SEO for Brand Visibility & Recognition
aleyda
0
4.5k
Mozcon NYC 2025: Stop Losing SEO Traffic
samtorres
0
200
Prompt Engineering for Job Search
mfonobong
0
260
30 Presentation Tips
portentint
PRO
1
270
GraphQLの誤解/rethinking-graphql
sonatard
75
12k
New Earth Scene 8
popppiees
3
2.1k
Documentation Writing (for coders)
carmenintech
77
5.3k
Transcript
There's No Clusterf*ck without a Cluster How @GoVictorOps went from
unicorns and broken to boring and working
Premature availabilization? • Connect you with your monitors • Harass
you when stuff breaks @boulderDanH
Availability is our DNA • Scala • Akka • Kafka
• Shard key
What is clustering?
An online encyclopedia says • Computers working together (appeals to
authority)
A dictionary says • clus·ter noun \ˈkləs-tər\ a number of
similar things that occur together (includes pronunciation for legitimacy)
Our definition • Who is currently in the cluster? •
Tell me when nodes are coming and going • High Availability / scaling
Requirements 1.0 1. Logical actor tree 2. Service discovery 3.
Lead me to success
Logical actor tree • Failover • Hand off
Service discovery • “cluster://user/victorops/broadcaster” ! “hello”
Tradeoffs are everywhere Vector clocks are totally cool Async consensus?
None
Implementation • Routers / Patterns • Native = Truth
Actor state • Easy and Tempting • Painful to unwind
None
None
What could go wrong? • Partitions are permanent • Want
some config? How about six! ◦ failure-detector.threshold x 2 ◦ failure-detector.min-std-deviation x 2 ◦ failure-detector.acceptable-heartbeat-pause x 2 • Hazelcast uses hazelcast.max.no.heartbeat.seconds • ZooKeeper uses “session timeout”
More picking on Akka • Logging during failures is sparse
• Remoting / Failure detection weren’t bulkheaded
Recap 1. Logical actor tree 2. Service discovery 3. Lead
me to success
Requirements 2.0 1. Member lists 2. Easy to configure, ability
to add machines w/o config 3. Pass remoting address around
None
What is Hazelcast? • Distributed maps & locks • Multicast
(IGMP)
Implementation akka.remote.quarantine-systems-for = "off" akka.remote.gate-invalid-addresses-for = 0 s src: akka-devel
• Publish Akka address using a map • Detect nodes joining / leaving cluster
• Multicast • In memory • Cluster Client • Member
list isn’t consistent across cluster What went wrong?
Recap on requirements 2.0 1. Member lists 2. Easy to
configure 3. Pass remoting address around
Requirements 3.0 1. Member list is consistent 2. Cluster clients
are first class
Cluster Membership • Consistent - Zk • Probably consistent -
Gossip • YOLO consistency - Hazelcast
… no seriously, this is the logo
What is ZooKeeper? • Clustered, consistent file system • API
is focused on building distributed concepts
Implementation • Cluster Membership = EPHEMERAL • Leader Election =
SEQUENTIAL • “Cluster” = EPHEMERAL_SEQUENTIAL • Store akka addresses in ephemeral nodes • Curator project
The Good • Reputation • Strong Consistency • Cluster clients
/ Service Discovery
What was / is hard? • Twitter’s Zk library •
External Cluster Manager
The final tally • Solid concept of membership • Keep
things simple • Log / Graph / Monitor everything
Questions?