9/27/15
@evan2645
Resilient Infrastructure
Orchestration with Serf
EVAN GILMAN
Slide 2
Slide 2 text
9/27/15
@evan2645
About Me
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Slide 3
Slide 3 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
What We Needed
Why Serf
How Serf
Use Cases
Summary
Slide 4
Slide 4 text
9/27/15
@evan2645
About PagerDuty
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Slide 5
Slide 5 text
9/27/15
@evan2645
What We Needed
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Slide 6
Slide 6 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Cassandra Repairs
Firewall Updates
Chef-client Run Coordination
and more…
Maintenance Tasks
Slide 7
Slide 7 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Some of it is done in Cron w/ Chef
Maintenance Tasks
Slide 8
Slide 8 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Some of it is done in Cron w/ Chef
Maintenance Tasks
Slide 9
Slide 9 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Some of it is done in Cron w/ Chef
Some of it is done with SSH
Maintenance Tasks
Slide 10
Slide 10 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
What We Needed
• Must be FOSS
• Must be lightweight
• Must be easy to deploy
• Must be resilient against network failures
Slide 11
Slide 11 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Must Be Resilient Against Network Failures
DC-A
DC-B
DC-C
Slide 12
Slide 12 text
9/27/15
@evan2645
• WAN introduces ‘exotic’ failure modes
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Must Be Resilient Against Network Failures
DC-A
DC-B
DC-C
Slide 13
Slide 13 text
9/27/15
@evan2645
• WAN introduces ‘exotic’ failure modes
• Route failures not infrequent
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Must Be Resilient Against Network Failures
DC-A
DC-B
DC-C
X
Slide 14
Slide 14 text
9/27/15
@evan2645
• WAN introduces ‘exotic’ failure modes
• Route failures not infrequent
• Can last for a long time
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Must Be Resilient Against Network Failures
DC-A
DC-B
DC-C
X
Slide 15
Slide 15 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Must Be Resilient Against Network Failures
Inter-DC RTT Time in ms
Slide 16
Slide 16 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Task execution should not bring
down your overall resiliency profile
Slide 17
Slide 17 text
9/27/15
@evan2645
Why Serf?
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Slide 18
Slide 18 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Existing Options
9/27/15
@evan2645
• Ansible, MCollective, Pushy, Fabric, etc…
• None can handle reachability issues
• Serf not affected
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Existing Options
Slide 22
Slide 22 text
9/27/15
@evan2645
• Gossip for communication
• SWIM for membership
• Direct/Indirect probe + Suspicion
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Serf Not Affected
Slide 23
Slide 23 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Serf Not Affected
Slide 24
Slide 24 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Serf Not Affected
✓
Slide 25
Slide 25 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Serf Not Affected
X
Slide 26
Slide 26 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Serf Not Affected
X
X
✓
Slide 27
Slide 27 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Serf Not Affected
X
X
Slide 28
Slide 28 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Serf Not Affected
Serf Event
Suspicion
Slide 29
Slide 29 text
9/27/15
@evan2645
Serf Not Affected
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Serf Event
Veto
I’m still
here…
Slide 30
Slide 30 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Serf is Easy
• Easily deployed
• Easily extended (RPC + Event Handlers)
• STDERR/STDOUT not needed w/ proper tooling
Slide 31
Slide 31 text
9/27/15
@evan2645
How Serf?
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Slide 32
Slide 32 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Serf for Orchestration
• Event handlers for various tasks
• Agents tagged w/ Chef role, DC, more
• Queries to pass data and get ack’s
Slide 33
Slide 33 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Serf for Orchestration
• Serfx
Slide 34
Slide 34 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Serf for Orchestration
• Serfx
• AsyncJob for Job Management
Slide 35
Slide 35 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Serf for Orchestration
• Serfx
• AsyncJob for Job Management
• Serf only provides transport/execution…
Slide 36
Slide 36 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Blender for Orchestration
Slide 37
Slide 37 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Blender for Orchestration
• Ruby DSL
Slide 38
Slide 38 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Blender for Orchestration
• Ruby DSL
• Pluggable discovery drivers
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Blender for Orchestration
• Serf used where SSH not required
• SSH still used for cloud-neutral bootstrap
• Can mix/match drivers
Slide 41
Slide 41 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Slide 42
Slide 42 text
9/27/15
@evan2645
Some Use Cases
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Slide 43
Slide 43 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Chef Runs!
• Runs polluted with external calls
• External resources are constrained
• Runs must be evenly spread out
• Cron hacks only go so far…
Slide 44
Slide 44 text
9/27/15
@evan2645
• Concurrency and splay must be tunable
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Chef Runs!
Slide 45
Slide 45 text
9/27/15
@evan2645
• Concurrency and splay must be tunable
• Job management using AsyncJob
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Chef Runs!
Slide 46
Slide 46 text
9/27/15
@evan2645
• Concurrency and splay must be tunable
• Job management using AsyncJob
• Blender emits serf queries in batches
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Chef Runs!
Slide 47
Slide 47 text
9/27/15
@evan2645
• Concurrency and splay must be tunable
• Job management using AsyncJob
• Blender emits serf queries in batches
• Success measured indirectly
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Chef Runs!
Slide 48
Slide 48 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Chef Runs!
Slide 49
Slide 49 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Cassandra!
Slide 50
Slide 50 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Cassandra!
• Repair one node/range at a time
• Repairs must not overlap
• Cron calculation is unwieldy and error-prone
Slide 51
Slide 51 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Cassandra!
• Blender discovers cass hosts via Serf
• Blender emits Serf repair queries serially
• AsyncJob allows us to block until completion
• Cron pain and danger gone!
Slide 52
Slide 52 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Looking Forward: Network Bootstrap!
• PD network resembles full-mesh (p2p policies)
• All immediate host dependencies need updates
• Automation inside Chef
• Currently being extracted into Go binaries
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Looking Forward: Network Bootstrap!
• Blender-based provisioning
• Handler ‘subscribes’ to various roles
Slide 55
Slide 55 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Looking Forward: Network Bootstrap!
• Blender-based provisioning
• Handler ‘subscribes’ to various roles
• Blender fires node join event with role payload
Slide 56
Slide 56 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Looking Forward: Network Bootstrap!
• Blender-based provisioning
• Handler ‘subscribes’ to various roles
• Blender fires node join event with role payload
• Serf can help provisioner orchestration, too!
Slide 57
Slide 57 text
9/27/15
@evan2645
What Did it Buy Us?
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Slide 58
Slide 58 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Serfx + Blender
Slide 59
Slide 59 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Job Orchestration which is resilient
to network failure!
Slide 60
Slide 60 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
All of that?? Yea!
Slide 61
Slide 61 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
All of that?? Yea!
Plus, it’s just kind of cool :)
Slide 62
Slide 62 text
9/27/15
@evan2645
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
All of that?? Yea!
Plus, it’s just kind of cool :)
https://github.com/ranjib/serfx
https://github.com/PagerDuty/blender
Slide 63
Slide 63 text
9/27/15
@evan2645
Thank You!
Q&A
RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF
Slide 64
Slide 64 text
9/27/15
@evan2645
Resilient Infrastructure
Orchestration with Serf
EVAN GILMAN