Slide 1

Slide 1 text

How We Moved Hosted Chef to Erlang + MySQL and Why You Didn’t Notice Seth Falcon Development Lead @sfalcon 1 Thursday, May 24, 12

Slide 2

Slide 2 text

Server 2 Thursday, May 24, 12

Slide 3

Slide 3 text

http://www.flickr.com/photos/emzee/ Hosted Chef Private Chef Open Chef 3 Thursday, May 24, 12

Slide 4

Slide 4 text

Setup: Chef Server API • Merb, Ruby, Unicorn, Nginx • Stateless, horizontally scalable • Talks to • CouchDB, • authorization service (Erlang), • Solr 4 Thursday, May 24, 12

Slide 5

Slide 5 text

Typical Chef Server API Request 1.Public key for authentication 2.Node data from CouchDB (median 22K, 3rd Qu. 44K) 3.Authorization check 4.POST, GET, PUT, DELETE 5 Thursday, May 24, 12

Slide 6

Slide 6 text

6 Thursday, May 24, 12

Slide 7

Slide 7 text

7 Thursday, May 24, 12

Slide 8

Slide 8 text

Average Chef Server API Response Times 500 ms 8 Thursday, May 24, 12

Slide 9

Slide 9 text

CouchDB Uptime 9 Thursday, May 24, 12

Slide 10

Slide 10 text

Slow, Irregular, and Out of Control 10 Thursday, May 24, 12

Slide 11

Slide 11 text

Heavy on system resources 11 Thursday, May 24, 12

Slide 12

Slide 12 text

Heavy on system resources Why? 12 Thursday, May 24, 12

Slide 13

Slide 13 text

How much RAM should it use? 13 Thursday, May 24, 12

Slide 14

Slide 14 text

60 req/sec × 44K = 2.7MB 14 Thursday, May 24, 12

Slide 15

Slide 15 text

2.7MB data + code + copies... 27MB? 15 Thursday, May 24, 12

Slide 16

Slide 16 text

100MB at rest, after startup 16 Thursday, May 24, 12

Slide 17

Slide 17 text

204 MB per unicorn worker under load 17 Thursday, May 24, 12

Slide 18

Slide 18 text

Concurrency? One request per worker. 18 Thursday, May 24, 12

Slide 19

Slide 19 text

12 workers per server 19 Thursday, May 24, 12

Slide 20

Slide 20 text

8 servers 20 Thursday, May 24, 12

Slide 21

Slide 21 text

12 × 204 MB = 2.4 GB 8 × 2.4 GB = 19.2 GB for pulling JSON out of a database and returning it 21 Thursday, May 24, 12

Slide 22

Slide 22 text

Unicorns Eat RAM 22 Thursday, May 24, 12

Slide 23

Slide 23 text

Why do unicorns eat RAM? 23 Thursday, May 24, 12

Slide 24

Slide 24 text

Ruby MRI Garbage Collector Slab allocator 24 Thursday, May 24, 12

Slide 25

Slide 25 text

Ruby MRI Garbage Collector Mark and sweep 25 Thursday, May 24, 12

Slide 26

Slide 26 text

Ruby MRI Garbage Collector Mark and sweep All objects touched on GC, defeats COW Code == objects 26 Thursday, May 24, 12

Slide 27

Slide 27 text

Ruby MRI Garbage Collector One large request bloats the worker more CPU needed for GC 27 Thursday, May 24, 12

Slide 28

Slide 28 text

28 Thursday, May 24, 12

Slide 29

Slide 29 text

light-weight share nothing processes 29 Thursday, May 24, 12

Slide 30

Slide 30 text

2.6 KB per process 7 microseconds to spawn a process 30 Thursday, May 24, 12

Slide 31

Slide 31 text

VM handles concurrency Efficient use of multi-core 31 Thursday, May 24, 12

Slide 32

Slide 32 text

per-process GC 32 Thursday, May 24, 12

Slide 33

Slide 33 text

33 Thursday, May 24, 12

Slide 34

Slide 34 text

34 Thursday, May 24, 12

Slide 35

Slide 35 text

Webmachine a state machine for HTTP you provide the callbacks, it does the REST 35 Thursday, May 24, 12

Slide 36

Slide 36 text

Some hacking happens 36 Thursday, May 24, 12

Slide 37

Slide 37 text

{“name”: “a_role”, “json_class”: ”Chef::Role”} 37 Thursday, May 24, 12

Slide 38

Slide 38 text

How did we do? 38 Thursday, May 24, 12

Slide 39

Slide 39 text

Erlang Ruby idle 19MB 100MB loaded 75MB 204MB 39 Thursday, May 24, 12

Slide 40

Slide 40 text

Erlang Ruby 600MB 19.2GB 40 Thursday, May 24, 12

Slide 41

Slide 41 text

But wait! There’s more. 41 Thursday, May 24, 12

Slide 42

Slide 42 text

Where is Ruby API spending time? 42 Thursday, May 24, 12

Slide 43

Slide 43 text

DB calls? 43 Thursday, May 24, 12

Slide 44

Slide 44 text

JSON parsing/ rendering? 44 Thursday, May 24, 12

Slide 45

Slide 45 text

Crypto? 45 Thursday, May 24, 12

Slide 46

Slide 46 text

Garbage Collection? 46 Thursday, May 24, 12

Slide 47

Slide 47 text

Garbage Collection! 47 Thursday, May 24, 12

Slide 48

Slide 48 text

>40% CPU in GC 48 Thursday, May 24, 12

Slide 49

Slide 49 text

CPU Usage on Chef Server 49 Thursday, May 24, 12

Slide 50

Slide 50 text

50 Thursday, May 24, 12

Slide 51

Slide 51 text

Frequent GET/PUT of node JSON 51 Thursday, May 24, 12

Slide 52

Slide 52 text

compaction 52 Thursday, May 24, 12

Slide 53

Slide 53 text

No concurrency accessing a single database (until recently) 53 Thursday, May 24, 12

Slide 54

Slide 54 text

Database replication unreliable for 1000s of databases. Motivation: Why not CouchDB? 54 Thursday, May 24, 12

Slide 55

Slide 55 text

File handle and memory resource leaks Motivation: Why not CouchDB? 55 Thursday, May 24, 12

Slide 56

Slide 56 text

It became an operations “thing” Motivation: Why not CouchDB? 56 Thursday, May 24, 12

Slide 57

Slide 57 text

What we need in a data store • Happy with write heavy load • Support for sophisticated queries • Able to run HA 57 Thursday, May 24, 12

Slide 58

Slide 58 text

Did you consider NoSQL database X? 58 Thursday, May 24, 12

Slide 59

Slide 59 text

Yes, but we also asked: Why not SQL? 59 Thursday, May 24, 12

Slide 60

Slide 60 text

Measure! basho_bench 60 Thursday, May 24, 12

Slide 61

Slide 61 text

So we replaced Couchdb with MySQL 61 Thursday, May 24, 12

Slide 62

Slide 62 text

while the system was running 62 Thursday, May 24, 12

Slide 63

Slide 63 text

Live Migration: Starts out easy! 63 Thursday, May 24, 12

Slide 64

Slide 64 text

Live Migration: Starts out easy! 64 Thursday, May 24, 12

Slide 65

Slide 65 text

Live Migration in 3 Easy Steps 1.Put org into read-only mode 2.Copy from CouchDB to MySQL 3.Route org to Erchef 65 Thursday, May 24, 12

Slide 66

Slide 66 text

It Gets Harder 66 Thursday, May 24, 12

Slide 67

Slide 67 text

Migration Tool 1. Coordinate feature flippers and load balancer config 2. Move batches of orgs through migration 3. Track status of migration and individual orgs 4. Resume after crash 67 Thursday, May 24, 12

Slide 68

Slide 68 text

Real World Hard 68 Thursday, May 24, 12

Slide 69

Slide 69 text

Migration Tool 1. Track inflight write requests 2. Put org into read-only mode 3. Wait for inflight write requests to complete 4. Migrate org data 5. Reconfig/HUP load balancer 6. Handle errors 69 Thursday, May 24, 12

Slide 70

Slide 70 text

OTP + gen_fsm =:= Happy Migration Tool Organization Robustness state functions ✔ state record ✔ ✔ manager/worker processes ✔ ✔ supervision tree ✔ DETS local store ✔ FREE REPL 70 Thursday, May 24, 12

Slide 71

Slide 71 text

No migration plan survives contact with production http://en.wikiquote.org/wiki/Helmuth_von_Moltke_the_Elder 71 Thursday, May 24, 12

Slide 72

Slide 72 text

Database CPU CouchDB MySQL 72 Thursday, May 24, 12

Slide 73

Slide 73 text

Database Load Average CouchDB MySQL 73 Thursday, May 24, 12

Slide 74

Slide 74 text

API Average Latency 74 Thursday, May 24, 12

Slide 75

Slide 75 text

Chef Server Roles Endpoint 90th Latency 75 Thursday, May 24, 12

Slide 76

Slide 76 text

Chef Server Roles Endpoint 90th Latency 75 Thursday, May 24, 12

Slide 77

Slide 77 text

Database Memory CouchDB MySQL 76 Thursday, May 24, 12

Slide 78

Slide 78 text

CouchDB Write Requests 77 Thursday, May 24, 12

Slide 79

Slide 79 text

CouchDB Network Traffic 78 Thursday, May 24, 12

Slide 80

Slide 80 text

Network traffic on Chef Server 79 Thursday, May 24, 12

Slide 81

Slide 81 text

Progress Report Endpoint Status Nodes Deployed Search Deployed Roles Coded Data Bags Coded Environments in-progress Cookbooks in-progress Sandboxes in-progress 80 Thursday, May 24, 12

Slide 82

Slide 82 text

81 Thursday, May 24, 12