Slide 1

Slide 1 text

How to Migrate a Web App to Erlang, Change Databases, and Not Have Your Customers Notice Seth Falcon Development Lead @sfalcon Friday, March 30, 12

Slide 2

Slide 2 text

Setup: Chef • Infrastructure as code • Describe server config using Ruby DSL • Client/Server. Your servers run chef-client, talk to Chef server • It’s awesome. Friday, March 30, 12

Slide 3

Slide 3 text

Setup: Chef Server API • Merb, Ruby, Unicorn, Nginx • Stateless, horizontally scalable • Talks to • CouchDB, • authorization service (Erlang), • Solr Friday, March 30, 12

Slide 4

Slide 4 text

Typical Chef Server API Request 1. User public key for authentication 2. Node data from CouchDB (median 22K, 3rd Qu. 44K) 3. Authorization check 4. POST, GET, PUT, DELETE Friday, March 30, 12

Slide 5

Slide 5 text

Average Chef Server API Response Times Friday, March 30, 12

Slide 6

Slide 6 text

Slow, Irregular, and Out of Control Friday, March 30, 12

Slide 7

Slide 7 text

CouchDB Uptime Friday, March 30, 12

Slide 8

Slide 8 text

Heavy on system resources Friday, March 30, 12

Slide 9

Slide 9 text

How much RAM should it use? Friday, March 30, 12

Slide 10

Slide 10 text

60 req/sec × 44K = 2.7MB Friday, March 30, 12

Slide 11

Slide 11 text

2.7MB data + code + copies... 27MB? Friday, March 30, 12

Slide 12

Slide 12 text

100MB at rest, after startup Friday, March 30, 12

Slide 13

Slide 13 text

Concurrency? One request per worker. Friday, March 30, 12

Slide 14

Slide 14 text

204 MB per unicorn worker under load Friday, March 30, 12

Slide 15

Slide 15 text

12 workers per server Friday, March 30, 12

Slide 16

Slide 16 text

8 servers Friday, March 30, 12

Slide 17

Slide 17 text

12 × 204 MB = 2.4 GB 8 × 2.4 GB = 19.2 GB for pulling JSON out of a database and returning it Friday, March 30, 12

Slide 18

Slide 18 text

Unicorns Eat RAM Friday, March 30, 12

Slide 19

Slide 19 text

Friday, March 30, 12

Slide 20

Slide 20 text

Friday, March 30, 12

Slide 21

Slide 21 text

Webmachine Tips 1. Don’t force application logic into resource module callbacks 2. Sharing resource functions is simple 3. finish_request for logging, metrics, and error cleanup. 4. Use dispatch args for common resource config Friday, March 30, 12

Slide 22

Slide 22 text

Webmachine tip #1 1. Don’t force application logic to map to resource module callbacks Friday, March 30, 12

Slide 23

Slide 23 text

Friday, March 30, 12

Slide 24

Slide 24 text

1. Parse body in malformed_request 2. halt 404 in forbidden Friday, March 30, 12

Slide 25

Slide 25 text

Webmachine tip #1: Don’t force it forbidden(Req, State) -> try validate_headers(wrq:req_headers(Req)), {false, Req, State} catch throw:{org_not_found, Org} -> Msg = <<"organization not found">>, Req2 = wrq:set_resp_body(Msg), Req), {{halt, 404}, Req2, State}; throw:{json_too_large, Msg} -> Req2 = wrq:set_resp_body(<<"ETOOBIG">>), Req), {{halt, 413}, Req2, State}; throw:Why -> Msg = malformed_msg(Why, Req, State), NewReq = wrq:set_resp_body(Msg, Req), {true, NewReq, State} end. Friday, March 30, 12

Slide 26

Slide 26 text

Webmachine tip #2 2. Sharing resource functions is simple (if you share a common state record) Friday, March 30, 12

Slide 27

Slide 27 text

Webmachine tip #2: shared state record -record(base_state, { reqid, resource_state }). Friday, March 30, 12

Slide 28

Slide 28 text

Webmachine tip #2: helper macro -export([service_available/2, is_authorized/2, finish_request/2]). ?gen_wm_function(chef_rest_wm, service_available). ?gen_wm_function(chef_rest_wm, is_authorized). ?gen_wm_function(chef_rest_wm, finish_request). Friday, March 30, 12

Slide 29

Slide 29 text

Webmachine tip #2: helper macro -define(gen_wm_function(Module, Function), Function(Req, #base_state{}=State) -> Module:Function(Req, State)). Friday, March 30, 12

Slide 30

Slide 30 text

Webmachine tip #3 3. finish_request for logging, metrics, and error cleanup. Friday, March 30, 12

Slide 31

Slide 31 text

Webmachine tip #3: finish_request finish_request(Req, #base_state{reqid = ReqId}=State) -> try Code = wrq:response_code(Req), log_request(Req, State), stats_hero:report_metrics(ReqId, Code), stats_hero:stop_worker(ReqId), case Code of 500 -> %% sanitize response body Msg = <<"internal service error">>, Json = ejson:encode({[{<<"error">>, [Msg]}]}), Req1 = wrq:set_resp_header("Content-Type", "application/json", Req), {true, wrq:set_resp_body(Json, Req1), State}; _ -> {true, Req, State} end catch X:Y -> error_logger:error_report({X, Y, erlang:get_stacktrace()}) end. Friday, March 30, 12

Slide 32

Slide 32 text

Webmachine tip #4 4. Use dispatch args for common resource config Friday, March 30, 12

Slide 33

Slide 33 text

Webmachine tip #4: config via dispatch init([]) -> {ok, Ip} = application:get_env(chef_rest, ip), {ok, Port} = application:get_env(chef_rest, port), {ok, Dispatch} = file:consult(filename:join( [filename:dirname( code:which(?MODULE)), "..", "priv", "dispatch.conf"])), WebConfig = [{ip, Ip}, {port, Port}, {log_dir, "priv/log"}, {dispatch, add_resource_init(Dispatch)}], Web = {webmachine_mochiweb, {webmachine_mochiweb, start, [WebConfig]}, permanent, 5000, worker, dynamic}, {ok, { {one_for_one, 10, 10}, [Web]} }. Friday, March 30, 12

Slide 34

Slide 34 text

Webmachine tip #4 add_resource_init(Dispatch) -> Defaults = default_resource_init(), add_resource_init(Dispatch, Defaults, []). add_resource_init([Rule | Rest], Defaults, Acc) -> add_resource_init(Rest, Defaults, [add_init(Rule, Defaults) | Acc]); add_resource_init([], _Defaults, Acc) -> lists:reverse(Acc). add_init({Route, Guard, Module, Init}, Defaults) -> InitParams = Init ++ fetch_custom_init_params(Module, Defaults), {Route, Guard, Module, InitParams}; add_init({Route, Module, Init}, Defaults) -> InitParams = Init ++ fetch_custom_init_params(Module, Defaults), {Route, Module, InitParams}. fetch_custom_init_params(Module, Defaults) -> Exports = proplists:get_value(exports, Module:module_info()), case lists:member({fetch_init_params, 1}, Exports) of true -> Module:fetch_init_params(Defaults); false -> Defaults end. Friday, March 30, 12

Slide 35

Slide 35 text

How did we do? Friday, March 30, 12

Slide 36

Slide 36 text

Erlang Ruby idle 19MB 100MB loaded 75MB 204MB Friday, March 30, 12

Slide 37

Slide 37 text

Erlang Ruby 600MB 19.2GB Friday, March 30, 12

Slide 38

Slide 38 text

But wait! There’s more. Friday, March 30, 12

Slide 39

Slide 39 text

Where is Ruby API spending time? Friday, March 30, 12

Slide 40

Slide 40 text

DB calls? Friday, March 30, 12

Slide 41

Slide 41 text

JSON parsing/ rendering? Friday, March 30, 12

Slide 42

Slide 42 text

Crypto? Friday, March 30, 12

Slide 43

Slide 43 text

Garbage Collection? Friday, March 30, 12

Slide 44

Slide 44 text

Garbage Collection! Friday, March 30, 12

Slide 45

Slide 45 text

>40% CPU in GC Friday, March 30, 12

Slide 46

Slide 46 text

CPU Usage on Chef Server Friday, March 30, 12

Slide 47

Slide 47 text

Friday, March 30, 12

Slide 48

Slide 48 text

Frequent GET/PUT of node JSON Friday, March 30, 12

Slide 49

Slide 49 text

compaction Friday, March 30, 12

Slide 50

Slide 50 text

No concurrency accessing a single database (until recently) Friday, March 30, 12

Slide 51

Slide 51 text

Database replication unreliable for 1000s of databases. Motivation: Why not CouchDB? Friday, March 30, 12

Slide 52

Slide 52 text

File handle and memory resource leaks Motivation: Why not CouchDB? Friday, March 30, 12

Slide 53

Slide 53 text

It became an operations “thing” Motivation: Why not CouchDB? Friday, March 30, 12

Slide 54

Slide 54 text

What we need in a data store • Happy with write heavy load • Support for sophisticated queries • Able to run HA Friday, March 30, 12

Slide 55

Slide 55 text

Did you consider NoSQL database X? Friday, March 30, 12

Slide 56

Slide 56 text

Yes, but we also asked: Why not SQL? Friday, March 30, 12

Slide 57

Slide 57 text

Measure! basho_bench Friday, March 30, 12

Slide 58

Slide 58 text

So we replaced Couchdb with MySQL Friday, March 30, 12

Slide 59

Slide 59 text

while the system was running Friday, March 30, 12

Slide 60

Slide 60 text

Live Migration: Starts out easy! Friday, March 30, 12

Slide 61

Slide 61 text

Live Migration in 3 Easy Steps 1.Put org into read-only mode 2.Copy from CouchDB to MySQL 3.Route org to Erchef Friday, March 30, 12

Slide 62

Slide 62 text

It Gets Harder Friday, March 30, 12

Slide 63

Slide 63 text

Migration Tool 1. Coordinate feature flippers and load balancer config 2. Move batches of orgs through migration 3. Track status of migration and individual orgs 4. Resume after crash Friday, March 30, 12

Slide 64

Slide 64 text

Real World Hard Friday, March 30, 12

Slide 65

Slide 65 text

Migration Tool 1. Track inflight write requests 2. Put org into read-only mode 3. Wait for inflight write requests to complete 4. Migrate org data 5. Reconfig/HUP load balancer 6. Handle errors Friday, March 30, 12

Slide 66

Slide 66 text

Scripting with gen_fsm • Helper methods → states • Server state and supervision tree make crash recovery easier • Free REPL Friday, March 30, 12

Slide 67

Slide 67 text

OTP + gen_fsm =:= Happy Migration Tool Organization Robustness state functions ✔ state record ✔ ✔ manager/worker processes ✔ ✔ supervision tree ✔ DETS local store ✔ Friday, March 30, 12

Slide 68

Slide 68 text

No migration plan survives contact with production http://en.wikiquote.org/wiki/Helmuth_von_Moltke_the_Elder Friday, March 30, 12

Slide 69

Slide 69 text

Database CPU CouchDB MySQL Friday, March 30, 12

Slide 70

Slide 70 text

Database Load Average CouchDB MySQL Friday, March 30, 12

Slide 71

Slide 71 text

API Average Latency Friday, March 30, 12

Slide 72

Slide 72 text

Chef Server Roles Endpoint 90th Latency Friday, March 30, 12

Slide 73

Slide 73 text

Chef Server Roles Endpoint 90th Latency Friday, March 30, 12

Slide 74

Slide 74 text

Database Memory CouchDB MySQL Friday, March 30, 12

Slide 75

Slide 75 text

CouchDB Write Requests Friday, March 30, 12

Slide 76

Slide 76 text

CouchDB Network Traffic Friday, March 30, 12

Slide 77

Slide 77 text

Network traffic on Chef Server Friday, March 30, 12

Slide 78

Slide 78 text

Friday, March 30, 12