Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
There's no Clusterf*ck without a Cluster
Search
Dan Hopkins
April 19, 2014
Programming
190
1
Share
There's no Clusterf*ck without a Cluster
Dan Hopkins
April 19, 2014
More Decks by Dan Hopkins
See All by Dan Hopkins
Actors: not just for movies anymore
danielhopkins
1
160
Other Decks in Programming
See All in Programming
AlarmKitで明後日起きれるアラームアプリを作る
trickart
0
160
Why Laravel apps break—Mastering the fundamentals to keep them maintainable
kentaroutakeda
1
300
不変条件と整合性境界—ビジネスが決める設計判断と実現パターン / Invariants and Consistency Boundaries
nrslib
10
2.8k
Spec-Driven Development with AI-Agents: From High-Level Requirements to Working Software
antonarhipov
2
380
AIエージェントと協働するCLI開発 — BunとOpenClawで学んだこと
yoshikouki
1
220
AIチームを指揮するOSS「TAKT」活用術 / How to Use “TAKT,” an OSS Tool for Orchestrating AI Teams
nrslib
6
720
Lemonade + Foundry Toolkit でお手軽アプリ開発
seosoft
1
200
AI時代になぜ書くのか
mutsumix
0
470
LLM Plugin for Node-REDの利用方法と開発について
404background
0
130
ECR拡張スキャンでSBOMを収集して サプライチェーン攻撃の影響調査を 爆速で終わらせてみた
akihisaikeda
2
200
横断組織出身のQAEがインプロセスQAEでつまずいたこと・活かせたこと
ty89
0
430
Moments When Things Go Wrong
aurimas
3
120
Featured
See All Featured
Building Adaptive Systems
keathley
44
3k
Technical Leadership for Architectural Decision Making
baasie
3
380
Why Mistakes Are the Best Teachers: Turning Failure into a Pathway for Growth
auna
0
150
brightonSEO & MeasureFest 2025 - Christian Goodrich - Winning strategies for Black Friday CRO & PPC
cargoodrich
3
710
Everyday Curiosity
cassininazir
0
210
Six Lessons from altMBA
skipperchong
29
4.3k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.9k
VelocityConf: Rendering Performance Case Studies
addyosmani
333
25k
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
2
1.5k
Data-driven link building: lessons from a $708K investment (BrightonSEO talk)
szymonslowik
1
1.1k
Sam Torres - BigQuery for SEOs
techseoconnect
PRO
0
270
Measuring Dark Social's Impact On Conversion and Attribution
stephenakadiri
2
200
Transcript
There's No Clusterf*ck without a Cluster How @GoVictorOps went from
unicorns and broken to boring and working
Premature availabilization? • Connect you with your monitors • Harass
you when stuff breaks @boulderDanH
Availability is our DNA • Scala • Akka • Kafka
• Shard key
What is clustering?
An online encyclopedia says • Computers working together (appeals to
authority)
A dictionary says • clus·ter noun \ˈkləs-tər\ a number of
similar things that occur together (includes pronunciation for legitimacy)
Our definition • Who is currently in the cluster? •
Tell me when nodes are coming and going • High Availability / scaling
Requirements 1.0 1. Logical actor tree 2. Service discovery 3.
Lead me to success
Logical actor tree • Failover • Hand off
Service discovery • “cluster://user/victorops/broadcaster” ! “hello”
Tradeoffs are everywhere Vector clocks are totally cool Async consensus?
None
Implementation • Routers / Patterns • Native = Truth
Actor state • Easy and Tempting • Painful to unwind
None
None
What could go wrong? • Partitions are permanent • Want
some config? How about six! ◦ failure-detector.threshold x 2 ◦ failure-detector.min-std-deviation x 2 ◦ failure-detector.acceptable-heartbeat-pause x 2 • Hazelcast uses hazelcast.max.no.heartbeat.seconds • ZooKeeper uses “session timeout”
More picking on Akka • Logging during failures is sparse
• Remoting / Failure detection weren’t bulkheaded
Recap 1. Logical actor tree 2. Service discovery 3. Lead
me to success
Requirements 2.0 1. Member lists 2. Easy to configure, ability
to add machines w/o config 3. Pass remoting address around
None
What is Hazelcast? • Distributed maps & locks • Multicast
(IGMP)
Implementation akka.remote.quarantine-systems-for = "off" akka.remote.gate-invalid-addresses-for = 0 s src: akka-devel
• Publish Akka address using a map • Detect nodes joining / leaving cluster
• Multicast • In memory • Cluster Client • Member
list isn’t consistent across cluster What went wrong?
Recap on requirements 2.0 1. Member lists 2. Easy to
configure 3. Pass remoting address around
Requirements 3.0 1. Member list is consistent 2. Cluster clients
are first class
Cluster Membership • Consistent - Zk • Probably consistent -
Gossip • YOLO consistency - Hazelcast
… no seriously, this is the logo
What is ZooKeeper? • Clustered, consistent file system • API
is focused on building distributed concepts
Implementation • Cluster Membership = EPHEMERAL • Leader Election =
SEQUENTIAL • “Cluster” = EPHEMERAL_SEQUENTIAL • Store akka addresses in ephemeral nodes • Curator project
The Good • Reputation • Strong Consistency • Cluster clients
/ Service Discovery
What was / is hard? • Twitter’s Zk library •
External Cluster Manager
The final tally • Solid concept of membership • Keep
things simple • Log / Graph / Monitor everything
Questions?