Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyCon Ireland 2013 Saturday Morning Keynote - S...

PyCon Ireland 2013 Saturday Morning Keynote - Stephan Gorget (Facebook)

(via http://python.ie/pycon/2013/talks/datacenter_and_cluster_management__facebook/)
Datacenter and Cluster Management @ Facebook

Facebook is a company that moves fast, this means our infrastructure has to evolve quickly and adapt to the needs. We add, drain, put back to production and decommission datacenters and clusters almost on a daily basis. As the size of Facebook is increasing, it's becoming more and more complex to manage all those moving pieces. Facebook addresses the issue by writing a software that automates all those steps that compose the lifecycle of a cluster.

This talk will be about the journey of Facebook in that automation process and how python is used to build such a flexible tool.

PyCon Ireland

October 12, 2013
Tweet

More Decks by PyCon Ireland

Other Decks in Technology

Transcript

  1. Tools for cluster management • Provisioning (Kobold) • Automatic remediation

    (FBAR) • Configuration (Configerator) • Monitoring (VipMonitor and others)
  2. How was provisioning ? (2010) • One person knows it

    all ! • So we decided to write it down • No formal testing • and extremely manual
  3. What is Kobold ? • Components are Python Objects •

    Support multiple operations • Populate directory service • Define cluster-level processes • Anything you can code !
  4. What is Kobold ? • With those components, you create

    services : • Set up core services • Load balancer configuration • Memcache clusters • Configurable scope • World, Region, Datacenter, Cluster, POP
  5. class Test(Service): requires = ['Init', 'Test2'] def __init__(self, options): Service.__init__(self,

    options) self.config = ComponentList() self.config += [ RandomComponent( name='my_random_component', targets=all_cluster_types, args={'foo': 'test.{c}'}, ) ] class RandomComponent(Component): def check(self, foo): value = int(random.random()*10) if (value % 2): self.err("Magic 8-ball says no.") return False else: self.ok("Magic 8-ball says yay!") return True def turnup(self, foo): self.ok("I just turned up '%s'." % foo) return True def drain(self, foo): self.ok("I just drained '%s'." % foo) return True
  6. ➜ $ kobold check dc1c01 test [dc1c01] 1 service(s) lined-up:

    test. [dc1c01] Running 'check' for service 'test'... ! [dc1c01/check/test/my_random_component] Magic 8-ball says no. ! [dc1c01] Some components failed. Service 'test' failed. [dc1c01] 1 service(s) failed: test.
  7. ➜ $ kobold turnup dc1c01 test [dc1c01] 1 service(s) lined-up:

    test. Ready to rock and roll? [yn] y [dc1c01] Running 'turnup' for service 'test'... ! [dc1c01/turnup/test/my_random_component] Magic 8-ball says no. * [dc1c01/turnup/test/my_random_component] I just turned up 'test.prn1c01'. [dc1c01] Service 'test' passed! [dc1c01] 1 service(s) passed: test.
  8. ➜ $ kobold turnup dc1c01 test [dc1c01] 1 service(s) lined-up:

    test. Ready to rock and roll? [yn] y [dc1c01] Running 'turnup' for service 'test'... * [dc1c01/turnup/test/my_random_component] Magic 8-ball says yay! [dc1c01] Service 'test' passed! [dc1c01] 1 service(s) passed: test.
  9. Results • Turn-up phases reduced • LBs configured in <

    10 mins vs multiple days • Cluster turn up • From 7~8 Weeks in 2010 to < few days
  10. Results • Service dependencies checked in seconds, instead of human

    iterating on data • No longer rely on tribal knowledge • Code as documentation • Some numbers : • 20k lines of code • 75+ components • 90+ services
  11. SSH in a FOR Loop for i in $webserver_list do

    ssh $i /etc/init.d/webserver restart done
  12. Server Break/Fix • Deploy updated app versions • Restart services

    • Check for hardware failures • Swap systems • Clean up log spew
  13. Automation • Use mental model -> translate to code •

    Humans are bad at repetitive work • They forget • They need sleep or • Don’t sleep at all and make mistakes • Machines don’t need hugs or feeding
  14. Abstractions for automation • Build abstractions for human work •

    Extensible libraries breed bigger tools • Knobs & levers • Turn individual systems on & off • Shed load • Test things in production or in a subset of systems
  15. Conclusion • Automate as much as you can • Write

    code that is an abstraction • Test your code • Create libraries and APIs to control everything