How We Moved Hosted Chef to
Erlang + MySQL
and Why You Didn’t Notice
Seth Falcon
Development Lead
@sfalcon
1
Thursday, May 24, 12
Slide 2
Slide 2 text
Server
2
Thursday, May 24, 12
Slide 3
Slide 3 text
http://www.flickr.com/photos/emzee/
Hosted Chef
Private Chef
Open Chef
3
Thursday, May 24, 12
Slide 4
Slide 4 text
Setup: Chef Server API
• Merb, Ruby, Unicorn, Nginx
• Stateless, horizontally scalable
• Talks to
• CouchDB,
• authorization service (Erlang),
• Solr
4
Thursday, May 24, 12
Slide 5
Slide 5 text
Typical Chef Server API Request
1.Public key for authentication
2.Node data from CouchDB
(median 22K, 3rd Qu. 44K)
3.Authorization check
4.POST, GET, PUT, DELETE
5
Thursday, May 24, 12
Slide 6
Slide 6 text
6
Thursday, May 24, 12
Slide 7
Slide 7 text
7
Thursday, May 24, 12
Slide 8
Slide 8 text
Average Chef Server API Response Times
500 ms
8
Thursday, May 24, 12
Slide 9
Slide 9 text
CouchDB Uptime
9
Thursday, May 24, 12
Slide 10
Slide 10 text
Slow, Irregular, and Out
of Control
10
Thursday, May 24, 12
Slide 11
Slide 11 text
Heavy on system resources
11
Thursday, May 24, 12
Slide 12
Slide 12 text
Heavy on system resources
Why?
12
Thursday, May 24, 12
Slide 13
Slide 13 text
How much RAM should it
use?
13
Thursday, May 24, 12
Slide 14
Slide 14 text
60 req/sec × 44K =
2.7MB
14
Thursday, May 24, 12
Slide 15
Slide 15 text
2.7MB data + code +
copies...
27MB?
15
Thursday, May 24, 12
Slide 16
Slide 16 text
100MB
at rest, after startup
16
Thursday, May 24, 12
Slide 17
Slide 17 text
204 MB
per unicorn worker
under load
17
Thursday, May 24, 12
Slide 18
Slide 18 text
Concurrency?
One request per worker.
18
Thursday, May 24, 12
Slide 19
Slide 19 text
12 workers per server
19
Thursday, May 24, 12
Slide 20
Slide 20 text
8 servers
20
Thursday, May 24, 12
Slide 21
Slide 21 text
12 × 204 MB = 2.4 GB
8 × 2.4 GB =
19.2 GB
for pulling JSON out of a database and returning it
21
Thursday, May 24, 12
Where is Ruby API
spending time?
42
Thursday, May 24, 12
Slide 43
Slide 43 text
DB calls?
43
Thursday, May 24, 12
Slide 44
Slide 44 text
JSON parsing/
rendering?
44
Thursday, May 24, 12
Slide 45
Slide 45 text
Crypto?
45
Thursday, May 24, 12
Slide 46
Slide 46 text
Garbage Collection?
46
Thursday, May 24, 12
Slide 47
Slide 47 text
Garbage Collection!
47
Thursday, May 24, 12
Slide 48
Slide 48 text
>40% CPU in GC
48
Thursday, May 24, 12
Slide 49
Slide 49 text
CPU Usage on Chef Server
49
Thursday, May 24, 12
Slide 50
Slide 50 text
50
Thursday, May 24, 12
Slide 51
Slide 51 text
Frequent GET/PUT of
node JSON
51
Thursday, May 24, 12
Slide 52
Slide 52 text
compaction
52
Thursday, May 24, 12
Slide 53
Slide 53 text
No concurrency
accessing a single
database (until recently)
53
Thursday, May 24, 12
Slide 54
Slide 54 text
Database replication
unreliable for 1000s of
databases.
Motivation: Why not CouchDB?
54
Thursday, May 24, 12
Slide 55
Slide 55 text
File handle and memory
resource leaks
Motivation: Why not CouchDB?
55
Thursday, May 24, 12
Slide 56
Slide 56 text
It became an operations
“thing”
Motivation: Why not CouchDB?
56
Thursday, May 24, 12
Slide 57
Slide 57 text
What we need in a data store
• Happy with write heavy load
• Support for sophisticated
queries
• Able to run HA
57
Thursday, May 24, 12
Slide 58
Slide 58 text
Did you consider NoSQL
database X?
58
Thursday, May 24, 12
Slide 59
Slide 59 text
Yes, but we also asked:
Why not SQL?
59
Thursday, May 24, 12
Slide 60
Slide 60 text
Measure!
basho_bench
60
Thursday, May 24, 12
Slide 61
Slide 61 text
So we replaced
Couchdb with MySQL
61
Thursday, May 24, 12
Slide 62
Slide 62 text
while the system was running
62
Thursday, May 24, 12
Slide 63
Slide 63 text
Live Migration:
Starts out easy!
63
Thursday, May 24, 12
Slide 64
Slide 64 text
Live Migration:
Starts out easy!
64
Thursday, May 24, 12
Slide 65
Slide 65 text
Live Migration in 3 Easy Steps
1.Put org into read-only mode
2.Copy from CouchDB to MySQL
3.Route org to Erchef
65
Thursday, May 24, 12
Slide 66
Slide 66 text
It Gets Harder
66
Thursday, May 24, 12
Slide 67
Slide 67 text
Migration Tool
1. Coordinate feature flippers and load
balancer config
2. Move batches of orgs through migration
3. Track status of migration and individual
orgs
4. Resume after crash
67
Thursday, May 24, 12
Slide 68
Slide 68 text
Real World Hard
68
Thursday, May 24, 12
Slide 69
Slide 69 text
Migration Tool
1. Track inflight write requests
2. Put org into read-only mode
3. Wait for inflight write requests to complete
4. Migrate org data
5. Reconfig/HUP load balancer
6. Handle errors
69
Thursday, May 24, 12
Slide 70
Slide 70 text
OTP + gen_fsm =:= Happy Migration Tool
Organization Robustness
state functions ✔
state record ✔ ✔
manager/worker processes ✔ ✔
supervision tree ✔
DETS local store ✔
FREE
REPL
70
Thursday, May 24, 12
Slide 71
Slide 71 text
No migration plan survives contact
with production
http://en.wikiquote.org/wiki/Helmuth_von_Moltke_the_Elder
71
Thursday, May 24, 12
Slide 72
Slide 72 text
Database CPU
CouchDB MySQL
72
Thursday, May 24, 12
Slide 73
Slide 73 text
Database Load Average
CouchDB MySQL
73
Thursday, May 24, 12
Slide 74
Slide 74 text
API Average Latency
74
Thursday, May 24, 12
Slide 75
Slide 75 text
Chef Server Roles Endpoint 90th Latency
75
Thursday, May 24, 12
Slide 76
Slide 76 text
Chef Server Roles Endpoint 90th Latency
75
Thursday, May 24, 12
Slide 77
Slide 77 text
Database Memory
CouchDB MySQL
76
Thursday, May 24, 12
Slide 78
Slide 78 text
CouchDB Write Requests
77
Thursday, May 24, 12
Slide 79
Slide 79 text
CouchDB Network Traffic
78
Thursday, May 24, 12
Slide 80
Slide 80 text
Network traffic on Chef Server
79
Thursday, May 24, 12
Slide 81
Slide 81 text
Progress Report
Endpoint Status
Nodes Deployed
Search Deployed
Roles Coded
Data Bags Coded
Environments in-progress
Cookbooks in-progress
Sandboxes in-progress
80
Thursday, May 24, 12