Save 37% off PRO during our Black Friday Sale! »

Resilient Infrastructure Automation with Serf

C8a8889a30543fdb8cf2841a19d43834?s=47 Evan Gilman
September 28, 2015

Resilient Infrastructure Automation with Serf

Infrastructure orchestration systems are a family of tools that allow dispatching commands against a set of remote hosts in a controlled (often ordered) fashion. MCollective, fabric, ansible etc are few of them.

In this talk we'll discuss serf and blender, as another system orchestration tool. It was born out of our need to have a similar but network tolerant tool. We maintain large, distributed clusters over the WAN which need to withstand network outages. Serf's master-less, gossip style event dispatch mechanism and ability to execute handlers upon receiving events helped us to build our own tools where serf acts as the message dispatching mechanism and we get to implement the "what to do if this event received" part. Currently we are using this to automate entire fleet wide Chef runs, periodic cassandra operations (like restores, compaction/repairs etc), and more.

C8a8889a30543fdb8cf2841a19d43834?s=128

Evan Gilman

September 28, 2015
Tweet

Transcript

  1. 9/27/15 @evan2645 Resilient Infrastructure Orchestration with Serf EVAN GILMAN

  2. 9/27/15 @evan2645 About Me RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

  3. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF What We Needed

    Why Serf How Serf Use Cases Summary
  4. 9/27/15 @evan2645 About PagerDuty RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

  5. 9/27/15 @evan2645 What We Needed RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

  6. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Cassandra Repairs Firewall

    Updates Chef-client Run Coordination and more… Maintenance Tasks
  7. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Some of it

    is done in Cron w/ Chef Maintenance Tasks
  8. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Some of it

    is done in Cron w/ Chef Maintenance Tasks
  9. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Some of it

    is done in Cron w/ Chef Some of it is done with SSH Maintenance Tasks
  10. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF What We Needed

    • Must be FOSS • Must be lightweight • Must be easy to deploy • Must be resilient against network failures
  11. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Must Be Resilient

    Against Network Failures DC-A DC-B DC-C
  12. 9/27/15 @evan2645 • WAN introduces ‘exotic’ failure modes RESILIENT INFRASTRUCTURE

    ORCHESTRATION WITH SERF Must Be Resilient Against Network Failures DC-A DC-B DC-C
  13. 9/27/15 @evan2645 • WAN introduces ‘exotic’ failure modes • Route

    failures not infrequent RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Must Be Resilient Against Network Failures DC-A DC-B DC-C X
  14. 9/27/15 @evan2645 • WAN introduces ‘exotic’ failure modes • Route

    failures not infrequent • Can last for a long time RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Must Be Resilient Against Network Failures DC-A DC-B DC-C X
  15. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Must Be Resilient

    Against Network Failures Inter-DC RTT Time in ms
  16. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Task execution should

    not bring down your overall resiliency profile
  17. 9/27/15 @evan2645 Why Serf? RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

  18. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Existing Options

  19. 9/27/15 @evan2645 • Ansible, MCollective, Pushy, Fabric, etc… RESILIENT INFRASTRUCTURE

    ORCHESTRATION WITH SERF Existing Options
  20. 9/27/15 @evan2645 • Ansible, MCollective, Pushy, Fabric, etc… • None

    can handle reachability issues RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Existing Options
  21. 9/27/15 @evan2645 • Ansible, MCollective, Pushy, Fabric, etc… • None

    can handle reachability issues • Serf not affected RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Existing Options
  22. 9/27/15 @evan2645 • Gossip for communication • SWIM for membership

    • Direct/Indirect probe + Suspicion RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Not Affected
  23. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Not Affected

  24. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Not Affected

  25. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Not Affected

    X
  26. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Not Affected

    X X ✓
  27. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Not Affected

    X X
  28. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Not Affected

    Serf Event Suspicion
  29. 9/27/15 @evan2645 Serf Not Affected RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

    Serf Event Veto I’m still here…
  30. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf is Easy

    • Easily deployed • Easily extended (RPC + Event Handlers) • STDERR/STDOUT not needed w/ proper tooling
  31. 9/27/15 @evan2645 How Serf? RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

  32. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf for Orchestration

    • Event handlers for various tasks • Agents tagged w/ Chef role, DC, more • Queries to pass data and get ack’s
  33. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf for Orchestration

    • Serfx
  34. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf for Orchestration

    • Serfx • AsyncJob for Job Management
  35. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf for Orchestration

    • Serfx • AsyncJob for Job Management • Serf only provides transport/execution…
  36. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Blender for Orchestration

  37. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Blender for Orchestration

    • Ruby DSL
  38. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Blender for Orchestration

    • Ruby DSL • Pluggable discovery drivers
  39. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Blender for Orchestration

    • Ruby DSL • Pluggable discovery drivers • Pluggable execution drivers
  40. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Blender for Orchestration

    • Serf used where SSH not required • SSH still used for cloud-neutral bootstrap • Can mix/match drivers
  41. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

  42. 9/27/15 @evan2645 Some Use Cases RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

  43. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs! •

    Runs polluted with external calls • External resources are constrained • Runs must be evenly spread out • Cron hacks only go so far…
  44. 9/27/15 @evan2645 • Concurrency and splay must be tunable RESILIENT

    INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs!
  45. 9/27/15 @evan2645 • Concurrency and splay must be tunable •

    Job management using AsyncJob RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs!
  46. 9/27/15 @evan2645 • Concurrency and splay must be tunable •

    Job management using AsyncJob • Blender emits serf queries in batches RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs!
  47. 9/27/15 @evan2645 • Concurrency and splay must be tunable •

    Job management using AsyncJob • Blender emits serf queries in batches • Success measured indirectly RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs!
  48. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs!

  49. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Cassandra!

  50. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Cassandra! • Repair

    one node/range at a time • Repairs must not overlap • Cron calculation is unwieldy and error-prone
  51. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Cassandra! • Blender

    discovers cass hosts via Serf • Blender emits Serf repair queries serially • AsyncJob allows us to block until completion • Cron pain and danger gone!
  52. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Looking Forward: Network

    Bootstrap! • PD network resembles full-mesh (p2p policies) • All immediate host dependencies need updates • Automation inside Chef • Currently being extracted into Go binaries
  53. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Looking Forward: Network

    Bootstrap! • Blender-based provisioning
  54. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Looking Forward: Network

    Bootstrap! • Blender-based provisioning • Handler ‘subscribes’ to various roles
  55. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Looking Forward: Network

    Bootstrap! • Blender-based provisioning • Handler ‘subscribes’ to various roles • Blender fires node join event with role payload
  56. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Looking Forward: Network

    Bootstrap! • Blender-based provisioning • Handler ‘subscribes’ to various roles • Blender fires node join event with role payload • Serf can help provisioner orchestration, too!
  57. 9/27/15 @evan2645 What Did it Buy Us? RESILIENT INFRASTRUCTURE ORCHESTRATION

    WITH SERF
  58. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serfx + Blender

  59. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Job Orchestration which

    is resilient to network failure!
  60. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF All of that??

    Yea!
  61. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF All of that??

    Yea! Plus, it’s just kind of cool :)
  62. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF All of that??

    Yea! Plus, it’s just kind of cool :) https://github.com/ranjib/serfx https://github.com/PagerDuty/blender
  63. 9/27/15 @evan2645 Thank You! Q&A RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF

  64. 9/27/15 @evan2645 Resilient Infrastructure Orchestration with Serf EVAN GILMAN