PyCon Ireland 2013 Saturday Morning Keynote - Stephan Gorget (Facebook)

Data Center and Cluster Management at Facebook Stéphan Gorget

Servers Software Data Centers Network

The Site

Let’s talk about clusters

Cluster roles ? Service Cluster Back-End Cluster Front-End Cluster Web
Ads Multifeed Other Services Cache Services

Tools for cluster management • Provisioning (Kobold) • Automatic remediation
(FBAR) • Conﬁguration (Conﬁgerator) • Monitoring (VipMonitor and others)

Going to automation

How was provisioning ? (2010) • One person knows it
all ! • So we decided to write it down • No formal testing • and extremely manual

2010 - Manual conﬁguration

2011 - And then Kobold

What does Kobold do ?

What is Kobold ? • Components are Python Objects •
Support multiple operations • Populate directory service • Deﬁne cluster-level processes • Anything you can code !

What is Kobold ? • With those components, you create
services : • Set up core services • Load balancer conﬁguration • Memcache clusters • Conﬁgurable scope • World, Region, Datacenter, Cluster, POP

Where ?

Show me the code !

class Test(Service): requires = ['Init', 'Test2'] def __init__(self, options): Service.__init__(self,
options) self.config = ComponentList() self.config += [ RandomComponent( name='my_random_component', targets=all_cluster_types, args={'foo': 'test.{c}'}, ) ] class RandomComponent(Component): def check(self, foo): value = int(random.random()*10) if (value % 2): self.err("Magic 8-ball says no.") return False else: self.ok("Magic 8-ball says yay!") return True def turnup(self, foo): self.ok("I just turned up '%s'." % foo) return True def drain(self, foo): self.ok("I just drained '%s'." % foo) return True

➜ $ kobold check dc1c01 test [dc1c01] 1 service(s) lined-up:
test. [dc1c01] Running 'check' for service 'test'... ! [dc1c01/check/test/my_random_component] Magic 8-ball says no. ! [dc1c01] Some components failed. Service 'test' failed. [dc1c01] 1 service(s) failed: test.

➜ $ kobold turnup dc1c01 test [dc1c01] 1 service(s) lined-up:
test. Ready to rock and roll? [yn] y [dc1c01] Running 'turnup' for service 'test'... ! [dc1c01/turnup/test/my_random_component] Magic 8-ball says no. * [dc1c01/turnup/test/my_random_component] I just turned up 'test.prn1c01'. [dc1c01] Service 'test' passed! [dc1c01] 1 service(s) passed: test.

➜ $ kobold turnup dc1c01 test [dc1c01] 1 service(s) lined-up:
test. Ready to rock and roll? [yn] y [dc1c01] Running 'turnup' for service 'test'... * [dc1c01/turnup/test/my_random_component] Magic 8-ball says yay! [dc1c01] Service 'test' passed! [dc1c01] 1 service(s) passed: test.

Did you say audit ?

Kobold UI

How bold is kobold ?

Results • Turn-up phases reduced • LBs conﬁgured in <
10 mins vs multiple days • Cluster turn up • From 7~8 Weeks in 2010 to < few days

Results • Service dependencies checked in seconds, instead of human
iterating on data • No longer rely on tribal knowledge • Code as documentation • Some numbers : • 20k lines of code • 75+ components • 90+ services

More Automation

SSH in a FOR Loop for i in $webserver_list do
ssh $i /etc/init.d/webserver restart done

Server Break/Fix • Deploy updated app versions • Restart services
• Check for hardware failures • Swap systems • Clean up log spew

FBAR Retrieve Alerts Insert Jobs into Queue Calculate Jobs

FBAR Remediation Plugins FBAR API Repair API HW API Conﬁg
API Mon API

FBAR Plugin1 Plugin2 Repair Resolve Escalate

Keys for automation

Automation • Use mental model -> translate to code •
Humans are bad at repetitive work • They forget • They need sleep or • Don’t sleep at all and make mistakes • Machines don’t need hugs or feeding

Abstractions for automation • Build abstractions for human work •
Extensible libraries breed bigger tools • Knobs & levers • Turn individual systems on & off • Shed load • Test things in production or in a subset of systems

Automation Pitfalls • Can mask systemic problems • Cascading failures
• Unknown actors • Cultural fear

Conclusion • Automate as much as you can • Write
code that is an abstraction • Test your code • Create libraries and APIs to control everything

Thank you for your attention

PyCon Ireland 2013 Saturday Morning Keynote - S...

PyCon Ireland 2013 Saturday Morning Keynote - Stephan Gorget (Facebook)

More Decks by PyCon Ireland

Other Decks in Technology

Featured

Transcript