How to Migrate a Web App to Erlang, Change Databases, and Not Have Your Customers Notice Seth Falcon Development Lead @sfalcon

Setup: Chef • Infrastructure as code • Describe server config using Ruby DSL • Client/Server. Your servers run chef-client, talk to Chef server • It's awesome.

Setup: Chef Server API • Merb, Ruby, Unicorn, Nginx • Stateless, horizontally scalable • Talks to • CouchDB, • authorization service (Erlang), • Solr

Typical Chef Server API Request 1. User public key for authentication 2. Node data from CouchDB (median 22K, 3rd Qu. 44K) 3. Authorization check 4. POST, GET, PUT, DELETE

Average Chef Server API Response Times

Slow, Irregular, and Out of Control

CouchDB Uptime

Heavy on system resources

How much RAM should it use?

60 req/sec × 44K = 2.7MB

2.7MB data + code + copies... 27MB?

100MB at rest, after startup

Concurrency? One request per worker.

204 MB per unicorn worker under load

12 workers per server

8 servers

12 × 204 MB = 2.4 GB 8 × 2.4 GB = 19.2 GB for pulling JSON out of a database and returning it

Unicorns Eat RAM

Webmachine Tips 1. Don't force application logic into resource module callbacks 2. Sharing resource functions is simple 3. finish_request for logging, metrics, and error cleanup. 4. Use dispatch args for common resource config

Webmachine tip #1 1. Don't force application logic to map to resource module callbacks

1. Parse body in malformed_request 2. halt 404 in forbidden

Webmachine tip #1: Don't force it forbidden(Req, State) -> try validate_headers(wrq:req_headers(Req)), {false, Req, State} catch throw:{org_not_found, Org} -> Msg = <<"organization not found">>, Req2 = wrq:set_resp_body(Msg), Req), {{halt, 404}, Req2, State}; throw:{json_too_large, Msg} -> Req2 = wrq:set_resp_body(<<"ETOOBIG">>), Req), {{halt, 413}, Req2, State}; throw:Why -> Msg = malformed_msg(Why, Req, State), NewReq = wrq:set_resp_body(Msg, Req), {true, NewReq, State} end.

Webmachine tip #2 2. Sharing resource functions is simple (if you share a common state record)

Webmachine tip #2: shared state record -record(base_state, { reqid, resource_state }).

Webmachine tip #2: helper macro -export([service_available/2, is_authorized/2, finish_request/2]). ?gen_wm_function(chef_rest_wm, service_available). ?gen_wm_function(chef_rest_wm, is_authorized). ?gen_wm_function(chef_rest_wm, finish_request).

Webmachine tip #2: helper macro -define(gen_wm_function(Module, Function), Function(Req, #base_state{}=State) -> Module:Function(Req, State)).

Webmachine tip #3 3. finish_request for logging, metrics, and error cleanup.

Webmachine tip #3: finish_request finish_request(Req, #base_state{reqid = ReqId}=State) -> try Code = wrq:response_code(Req), log_request(Req, State), stats_hero:report_metrics(ReqId, Code), stats_hero:stop_worker(ReqId), case Code of 500 -> %% sanitize response body Msg = <<"internal service error">>, Json = ejson:encode({[{<<"error">>, [Msg]}]}), Req1 = wrq:set_resp_header("Content-Type", "application/json", Req), {true, wrq:set_resp_body(Json, Req1), State}; _ -> {true, Req, State} end catch X:Y -> error_logger:error_report({X, Y, erlang:get_stacktrace()}) end.

Webmachine tip #4 4. Use dispatch args for common resource config

Webmachine tip #4: config via dispatch init([]) -> {ok, Ip} = application:get_env(chef_rest, ip), {ok, Port} = application:get_env(chef_rest, port), {ok, Dispatch} = file:consult(filename:join( [filename:dirname( code:which(?MODULE)), "..", "priv", "dispatch.conf"])), WebConfig = [{ip, Ip}, {port, Port}, {log_dir, "priv/log"}, {dispatch, add_resource_init(Dispatch)}], Web = {webmachine_mochiweb, {webmachine_mochiweb, start, [WebConfig]}, permanent, 5000, worker, dynamic}, {ok, { {one_for_one, 10, 10}, [Web]} }.

Webmachine tip #4 add_resource_init(Dispatch) -> Defaults = default_resource_init(), add_resource_init(Dispatch, Defaults, []). add_resource_init([Rule | Rest], Defaults, Acc) -> add_resource_init(Rest, Defaults, [add_init(Rule, Defaults) | Acc]); add_resource_init([], _Defaults, Acc) -> lists:reverse(Acc). add_init({Route, Guard, Module, Init}, Defaults) -> InitParams = Init ++ fetch_custom_init_params(Module, Defaults), {Route, Guard, Module, InitParams}; add_init({Route, Module, Init}, Defaults) -> InitParams = Init ++ fetch_custom_init_params(Module, Defaults), {Route, Module, InitParams}. fetch_custom_init_params(Module, Defaults) -> Exports = proplists:get_value(exports, Module:module_info()), case lists:member({fetch_init_params, 1}, Exports) of true -> Module:fetch_init_params(Defaults); false -> Defaults end.

How did we do?

Erlang Ruby idle 19MB 100MB loaded 75MB 204MB

Erlang Ruby 600MB 19.2GB

But wait! There's more.

Where is Ruby API spending time?

DB calls?

JSON parsing/ rendering?

Crypto?

Garbage Collection?

Garbage Collection!

>40% CPU in GC

CPU Usage on Chef Server

Frequent GET/PUT of node JSON

compaction

No concurrency accessing a single database (until recently)

Database replication unreliable for 1000s of databases. Motivation: Why not CouchDB?

File handle and memory resource leaks Motivation: Why not CouchDB?

It

What we need in a data store • Happy with write heavy load • Support for sophisticated queries • Able to run HA Friday, March 30, 12

Did you consider NoSQL database X? Friday, March 30, 12

Yes, but we also asked: Why not SQL? Friday, March 30, 12

Measure! basho_bench Friday, March 30, 12

So we replaced Couchdb with MySQL Friday, March 30, 12

while the system was running Friday, March 30, 12

Live Migration: Starts out easy! Friday, March 30, 12

Live Migration in 3 Easy Steps 1.Put org into read-only mode 2.Copy from CouchDB to MySQL 3.Route org to Erchef Friday, March 30, 12

It Gets Harder Friday, March 30, 12

Migration Tool 1. Coordinate feature flippers and load balancer config 2. Move batches of orgs through migration 3. Track status of migration and individual orgs 4. Resume after crash Friday, March 30, 12

Real World Hard Friday, March 30, 12

Migration Tool 1. Track inflight write requests 2. Put org into read-only mode 3. Wait for inflight write requests to complete 4. Migrate org data 5. Reconfig/HUP load balancer 6. Handle errors Friday, March 30, 12

Scripting with gen_fsm • Helper methods → states • Server state and supervision tree make crash recovery easier • Free REPL Friday, March 30, 12

OTP + gen_fsm =:= Happy Migration Tool Organization Robustness state functions ✔ state record ✔ ✔ manager/worker processes ✔ ✔ supervision tree ✔ DETS local store ✔ Friday, March 30, 12

No migration plan survives contact with production Friday, March 30, 12

Database CPU CouchDB MySQL Friday, March 30, 12

Database Load Average CouchDB MySQL Friday, March 30, 12

API Average Latency Friday, March 30, 12

Chef Server Roles Endpoint 90th Latency Friday, March 30, 12

Chef Server Roles Endpoint 90th Latency Friday, March 30, 12

Database Memory CouchDB MySQL Friday, March 30, 12

CouchDB Write Requests Friday, March 30, 12

CouchDB Network Traffic Friday, March 30, 12

Network traffic on Chef Server Friday, March 30, 12

Friday, March 30, 12