Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Data Center and Cluster Management at Facebook Stéphan Gorget

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Servers Software Data Centers Network

Slide 5

Slide 5 text

The Site

Slide 6

Slide 6 text

Let’s talk about clusters

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Cluster roles ? Service Cluster Back-End Cluster Front-End Cluster Web Ads Multifeed Other Services Cache Services

Slide 12

Slide 12 text

Tools for cluster management • Provisioning (Kobold) • Automatic remediation (FBAR) • Configuration (Configerator) • Monitoring (VipMonitor and others)

Slide 13

Slide 13 text

Going to automation

Slide 14

Slide 14 text

How was provisioning ? (2010) • One person knows it all ! • So we decided to write it down • No formal testing • and extremely manual

Slide 15

Slide 15 text

2010 - Manual configuration

Slide 16

Slide 16 text

2011 - And then Kobold

Slide 17

Slide 17 text

What does Kobold do ?

Slide 18

Slide 18 text

What is Kobold ? • Components are Python Objects • Support multiple operations • Populate directory service • Define cluster-level processes • Anything you can code !

Slide 19

Slide 19 text

What is Kobold ? • With those components, you create services : • Set up core services • Load balancer configuration • Memcache clusters • Configurable scope • World, Region, Datacenter, Cluster, POP

Slide 20

Slide 20 text

Where ?

Slide 21

Slide 21 text

Show me the code !

Slide 22

Slide 22 text

class Test(Service): requires = ['Init', 'Test2'] def __init__(self, options): Service.__init__(self, options) self.config = ComponentList() self.config += [ RandomComponent( name='my_random_component', targets=all_cluster_types, args={'foo': 'test.{c}'}, ) ] class RandomComponent(Component): def check(self, foo): value = int(random.random()*10) if (value % 2): self.err("Magic 8-ball says no.") return False else: self.ok("Magic 8-ball says yay!") return True def turnup(self, foo): self.ok("I just turned up '%s'." % foo) return True def drain(self, foo): self.ok("I just drained '%s'." % foo) return True

Slide 23

Slide 23 text

➜ $ kobold check dc1c01 test [dc1c01] 1 service(s) lined-up: test. [dc1c01] Running 'check' for service 'test'... ! [dc1c01/check/test/my_random_component] Magic 8-ball says no. ! [dc1c01] Some components failed. Service 'test' failed. [dc1c01] 1 service(s) failed: test.

Slide 24

Slide 24 text

➜ $ kobold turnup dc1c01 test [dc1c01] 1 service(s) lined-up: test. Ready to rock and roll? [yn] y [dc1c01] Running 'turnup' for service 'test'... ! [dc1c01/turnup/test/my_random_component] Magic 8-ball says no. * [dc1c01/turnup/test/my_random_component] I just turned up 'test.prn1c01'. [dc1c01] Service 'test' passed! [dc1c01] 1 service(s) passed: test.

Slide 25

Slide 25 text

➜ $ kobold turnup dc1c01 test [dc1c01] 1 service(s) lined-up: test. Ready to rock and roll? [yn] y [dc1c01] Running 'turnup' for service 'test'... * [dc1c01/turnup/test/my_random_component] Magic 8-ball says yay! [dc1c01] Service 'test' passed! [dc1c01] 1 service(s) passed: test.

Slide 26

Slide 26 text

Did you say audit ?

Slide 27

Slide 27 text

Kobold UI

Slide 28

Slide 28 text

How bold is kobold ?

Slide 29

Slide 29 text

Results • Turn-up phases reduced • LBs configured in < 10 mins vs multiple days • Cluster turn up • From 7~8 Weeks in 2010 to < few days

Slide 30

Slide 30 text

Results • Service dependencies checked in seconds, instead of human iterating on data • No longer rely on tribal knowledge • Code as documentation • Some numbers : • 20k lines of code • 75+ components • 90+ services

Slide 31

Slide 31 text

More Automation

Slide 32

Slide 32 text

SSH in a FOR Loop for i in $webserver_list do ssh $i /etc/init.d/webserver restart done

Slide 33

Slide 33 text

Server Break/Fix • Deploy updated app versions • Restart services • Check for hardware failures • Swap systems • Clean up log spew

Slide 34

Slide 34 text

FBAR Retrieve Alerts Insert Jobs into Queue Calculate Jobs

Slide 35

Slide 35 text

FBAR Remediation Plugins FBAR API Repair API HW API Config API Mon API

Slide 36

Slide 36 text

FBAR Plugin1 Plugin2 Repair Resolve Escalate

Slide 37

Slide 37 text

Keys for automation

Slide 38

Slide 38 text

Automation • Use mental model -> translate to code • Humans are bad at repetitive work • They forget • They need sleep or • Don’t sleep at all and make mistakes • Machines don’t need hugs or feeding

Slide 39

Slide 39 text

Abstractions for automation • Build abstractions for human work • Extensible libraries breed bigger tools • Knobs & levers • Turn individual systems on & off • Shed load • Test things in production or in a subset of systems

Slide 40

Slide 40 text

Automation Pitfalls • Can mask systemic problems • Cascading failures • Unknown actors • Cultural fear

Slide 41

Slide 41 text

Conclusion • Automate as much as you can • Write code that is an abstraction • Test your code • Create libraries and APIs to control everything

Slide 42

Slide 42 text

Thank you for your attention

Slide 43

Slide 43 text

No content