Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilient Infrastructure Automation with Serf

Evan Gilman
September 28, 2015

Resilient Infrastructure Automation with Serf

Infrastructure orchestration systems are a family of tools that allow dispatching commands against a set of remote hosts in a controlled (often ordered) fashion. MCollective, fabric, ansible etc are few of them.

In this talk we'll discuss serf and blender, as another system orchestration tool. It was born out of our need to have a similar but network tolerant tool. We maintain large, distributed clusters over the WAN which need to withstand network outages. Serf's master-less, gossip style event dispatch mechanism and ability to execute handlers upon receiving events helped us to build our own tools where serf acts as the message dispatching mechanism and we get to implement the "what to do if this event received" part. Currently we are using this to automate entire fleet wide Chef runs, periodic cassandra operations (like restores, compaction/repairs etc), and more.

Evan Gilman

September 28, 2015
Tweet

More Decks by Evan Gilman

Other Decks in Technology

Transcript

  1. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Cassandra Repairs Firewall

    Updates Chef-client Run Coordination and more… Maintenance Tasks
  2. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Some of it

    is done in Cron w/ Chef Some of it is done with SSH Maintenance Tasks
  3. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF What We Needed

    • Must be FOSS • Must be lightweight • Must be easy to deploy • Must be resilient against network failures
  4. 9/27/15 @evan2645 • WAN introduces ‘exotic’ failure modes RESILIENT INFRASTRUCTURE

    ORCHESTRATION WITH SERF Must Be Resilient Against Network Failures DC-A DC-B DC-C
  5. 9/27/15 @evan2645 • WAN introduces ‘exotic’ failure modes • Route

    failures not infrequent RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Must Be Resilient Against Network Failures DC-A DC-B DC-C X
  6. 9/27/15 @evan2645 • WAN introduces ‘exotic’ failure modes • Route

    failures not infrequent • Can last for a long time RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Must Be Resilient Against Network Failures DC-A DC-B DC-C X
  7. 9/27/15 @evan2645 • Ansible, MCollective, Pushy, Fabric, etc… • None

    can handle reachability issues RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Existing Options
  8. 9/27/15 @evan2645 • Ansible, MCollective, Pushy, Fabric, etc… • None

    can handle reachability issues • Serf not affected RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Existing Options
  9. 9/27/15 @evan2645 • Gossip for communication • SWIM for membership

    • Direct/Indirect probe + Suspicion RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf Not Affected
  10. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf is Easy

    • Easily deployed • Easily extended (RPC + Event Handlers) • STDERR/STDOUT not needed w/ proper tooling
  11. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf for Orchestration

    • Event handlers for various tasks • Agents tagged w/ Chef role, DC, more • Queries to pass data and get ack’s
  12. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Serf for Orchestration

    • Serfx • AsyncJob for Job Management • Serf only provides transport/execution…
  13. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Blender for Orchestration

    • Ruby DSL • Pluggable discovery drivers • Pluggable execution drivers
  14. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Blender for Orchestration

    • Serf used where SSH not required • SSH still used for cloud-neutral bootstrap • Can mix/match drivers
  15. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs! •

    Runs polluted with external calls • External resources are constrained • Runs must be evenly spread out • Cron hacks only go so far…
  16. 9/27/15 @evan2645 • Concurrency and splay must be tunable RESILIENT

    INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs!
  17. 9/27/15 @evan2645 • Concurrency and splay must be tunable •

    Job management using AsyncJob RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs!
  18. 9/27/15 @evan2645 • Concurrency and splay must be tunable •

    Job management using AsyncJob • Blender emits serf queries in batches RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs!
  19. 9/27/15 @evan2645 • Concurrency and splay must be tunable •

    Job management using AsyncJob • Blender emits serf queries in batches • Success measured indirectly RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Chef Runs!
  20. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Cassandra! • Repair

    one node/range at a time • Repairs must not overlap • Cron calculation is unwieldy and error-prone
  21. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Cassandra! • Blender

    discovers cass hosts via Serf • Blender emits Serf repair queries serially • AsyncJob allows us to block until completion • Cron pain and danger gone!
  22. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Looking Forward: Network

    Bootstrap! • PD network resembles full-mesh (p2p policies) • All immediate host dependencies need updates • Automation inside Chef • Currently being extracted into Go binaries
  23. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Looking Forward: Network

    Bootstrap! • Blender-based provisioning • Handler ‘subscribes’ to various roles
  24. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Looking Forward: Network

    Bootstrap! • Blender-based provisioning • Handler ‘subscribes’ to various roles • Blender fires node join event with role payload
  25. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF Looking Forward: Network

    Bootstrap! • Blender-based provisioning • Handler ‘subscribes’ to various roles • Blender fires node join event with role payload • Serf can help provisioner orchestration, too!
  26. 9/27/15 @evan2645 RESILIENT INFRASTRUCTURE ORCHESTRATION WITH SERF All of that??

    Yea! Plus, it’s just kind of cool :) https://github.com/ranjib/serfx https://github.com/PagerDuty/blender