Slide 1

Slide 1 text

9/27/15 @evan2645 Resilient Infrastructure Orchestration with Serf EVAN GILMAN

Slide 2

Slide 2 text

9/27/15 @evan2645 About Me RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

Slide 3

Slide 3 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF What We Needed Why Serf How Serf Use Cases Summary

Slide 4

Slide 4 text

9/27/15 @evan2645 About PagerDuty RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

Slide 5

Slide 5 text

9/27/15 @evan2645 What We Needed RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

Slide 6

Slide 6 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Cassandra Repairs Firewall Updates Chef-client Run Coordination and more… Maintenance Tasks

Slide 7

Slide 7 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Some of it is done in Cron w/ Chef Maintenance Tasks

Slide 8

Slide 8 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Some of it is done in Cron w/ Chef Maintenance Tasks

Slide 9

Slide 9 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Some of it is done in Cron w/ Chef Some of it is done with SSH Maintenance Tasks

Slide 10

Slide 10 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF What We Needed • Must be FOSS • Must be lightweight • Must be easy to deploy • Must be resilient against network failures

Slide 11

Slide 11 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Must Be Resilient Against Network Failures DC-A DC-B DC-C

Slide 12

Slide 12 text

9/27/15 @evan2645 • WAN introduces ‘exotic’ failure modes RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Must Be Resilient Against Network Failures DC-A DC-B DC-C

Slide 13

Slide 13 text

9/27/15 @evan2645 • WAN introduces ‘exotic’ failure modes • Route failures not infrequent RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Must Be Resilient Against Network Failures DC-A DC-B DC-C X

Slide 14

Slide 14 text

9/27/15 @evan2645 • WAN introduces ‘exotic’ failure modes • Route failures not infrequent • Can last for a long time RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Must Be Resilient Against Network Failures DC-A DC-B DC-C X

Slide 15

Slide 15 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Must Be Resilient Against Network Failures Inter-DC RTT Time in ms

Slide 16

Slide 16 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Task execution should not bring down your overall resiliency profile

Slide 17

Slide 17 text

9/27/15 @evan2645 Why Serf? RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

Slide 18

Slide 18 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Existing Options

Slide 19

Slide 19 text

9/27/15 @evan2645 • Ansible, MCollective, Pushy, Fabric, etc… RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Existing Options

Slide 20

Slide 20 text

9/27/15 @evan2645 • Ansible, MCollective, Pushy, Fabric, etc… • None can handle reachability issues RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Existing Options

Slide 21

Slide 21 text

9/27/15 @evan2645 • Ansible, MCollective, Pushy, Fabric, etc… • None can handle reachability issues • Serf not affected RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Existing Options

Slide 22

Slide 22 text

9/27/15 @evan2645 • Gossip for communication • SWIM for membership • Direct/Indirect probe + Suspicion RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Not Affected

Slide 23

Slide 23 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Not Affected

Slide 24

Slide 24 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Not Affected ✓

Slide 25

Slide 25 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Not Affected X

Slide 26

Slide 26 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Not Affected X X ✓

Slide 27

Slide 27 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Not Affected X X

Slide 28

Slide 28 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Not Affected Serf Event Suspicion

Slide 29

Slide 29 text

9/27/15 @evan2645 Serf Not Affected RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Event Veto I’m still here…

Slide 30

Slide 30 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf is Easy • Easily deployed • Easily extended (RPC + Event Handlers) • STDERR/STDOUT not needed w/ proper tooling

Slide 31

Slide 31 text

9/27/15 @evan2645 How Serf? RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

Slide 32

Slide 32 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf for Orchestration • Event handlers for various tasks • Agents tagged w/ Chef role, DC, more • Queries to pass data and get ack’s

Slide 33

Slide 33 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf for Orchestration • Serfx

Slide 34

Slide 34 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf for Orchestration • Serfx • AsyncJob for Job Management

Slide 35

Slide 35 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf for Orchestration • Serfx • AsyncJob for Job Management • Serf only provides transport/execution…

Slide 36

Slide 36 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Blender for Orchestration

Slide 37

Slide 37 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Blender for Orchestration • Ruby DSL

Slide 38

Slide 38 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Blender for Orchestration • Ruby DSL • Pluggable discovery drivers

Slide 39

Slide 39 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Blender for Orchestration • Ruby DSL • Pluggable discovery drivers • Pluggable execution drivers

Slide 40

Slide 40 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Blender for Orchestration • Serf used where SSH not required • SSH still used for cloud-neutral bootstrap • Can mix/match drivers

Slide 41

Slide 41 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

Slide 42

Slide 42 text

9/27/15 @evan2645 Some Use Cases RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

Slide 43

Slide 43 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs! • Runs polluted with external calls • External resources are constrained • Runs must be evenly spread out • Cron hacks only go so far…

Slide 44

Slide 44 text

9/27/15 @evan2645 • Concurrency and splay must be tunable RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs!

Slide 45

Slide 45 text

9/27/15 @evan2645 • Concurrency and splay must be tunable • Job management using AsyncJob RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs!

Slide 46

Slide 46 text

9/27/15 @evan2645 • Concurrency and splay must be tunable • Job management using AsyncJob • Blender emits serf queries in batches RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs!

Slide 47

Slide 47 text

9/27/15 @evan2645 • Concurrency and splay must be tunable • Job management using AsyncJob • Blender emits serf queries in batches • Success measured indirectly RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs!

Slide 48

Slide 48 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs!

Slide 49

Slide 49 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Cassandra!

Slide 50

Slide 50 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Cassandra! • Repair one node/range at a time • Repairs must not overlap • Cron calculation is unwieldy and error-prone

Slide 51

Slide 51 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Cassandra! • Blender discovers cass hosts via Serf • Blender emits Serf repair queries serially • AsyncJob allows us to block until completion • Cron pain and danger gone!

Slide 52

Slide 52 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Looking Forward: Network Bootstrap! • PD network resembles full-mesh (p2p policies) • All immediate host dependencies need updates • Automation inside Chef • Currently being extracted into Go binaries

Slide 53

Slide 53 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Looking Forward: Network Bootstrap! • Blender-based provisioning

Slide 54

Slide 54 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Looking Forward: Network Bootstrap! • Blender-based provisioning • Handler ‘subscribes’ to various roles

Slide 55

Slide 55 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Looking Forward: Network Bootstrap! • Blender-based provisioning • Handler ‘subscribes’ to various roles • Blender fires node join event with role payload

Slide 56

Slide 56 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Looking Forward: Network Bootstrap! • Blender-based provisioning • Handler ‘subscribes’ to various roles • Blender fires node join event with role payload • Serf can help provisioner orchestration, too!

Slide 57

Slide 57 text

9/27/15 @evan2645 What Did it Buy Us? RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

Slide 58

Slide 58 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serfx + Blender

Slide 59

Slide 59 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Job Orchestration which is resilient to network failure!

Slide 60

Slide 60 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF All of that?? Yea!

Slide 61

Slide 61 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF All of that?? Yea! Plus, it’s just kind of cool :)

Slide 62

Slide 62 text

9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF All of that?? Yea! Plus, it’s just kind of cool :) https://github.com/ranjib/serfx https://github.com/PagerDuty/blender

Slide 63

Slide 63 text

9/27/15 @evan2645 Thank You! Q&A RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

Slide 64

Slide 64 text

9/27/15 @evan2645 Resilient Infrastructure Orchestration with Serf EVAN GILMAN