Save 37% off PRO during our Black Friday Sale! »

How to Migrate a Web App to Erlang, Change Databases, and Not Have Your Customers Notice

How to Migrate a Web App to Erlang, Change Databases, and Not Have Your Customers Notice

In which we present a case study of migrating a high volume web API from Ruby/CouchDB to Erlang/MySQL.

49b59b4f0027999a551728da1fae3029?s=128

Seth Falcon

March 30, 2012
Tweet

Transcript

  1. How to Migrate a Web App to Erlang, Change Databases,

    and Not Have Your Customers Notice Seth Falcon Development Lead @sfalcon Friday, March 30, 12
  2. Setup: Chef • Infrastructure as code • Describe server config

    using Ruby DSL • Client/Server. Your servers run chef-client, talk to Chef server • It’s awesome. Friday, March 30, 12
  3. Setup: Chef Server API • Merb, Ruby, Unicorn, Nginx •

    Stateless, horizontally scalable • Talks to • CouchDB, • authorization service (Erlang), • Solr Friday, March 30, 12
  4. Typical Chef Server API Request 1. User public key for

    authentication 2. Node data from CouchDB (median 22K, 3rd Qu. 44K) 3. Authorization check 4. POST, GET, PUT, DELETE Friday, March 30, 12
  5. Average Chef Server API Response Times Friday, March 30, 12

  6. Slow, Irregular, and Out of Control Friday, March 30, 12

  7. CouchDB Uptime Friday, March 30, 12

  8. Heavy on system resources Friday, March 30, 12

  9. How much RAM should it use? Friday, March 30, 12

  10. 60 req/sec × 44K = 2.7MB Friday, March 30, 12

  11. 2.7MB data + code + copies... 27MB? Friday, March 30,

    12
  12. 100MB at rest, after startup Friday, March 30, 12

  13. Concurrency? One request per worker. Friday, March 30, 12

  14. 204 MB per unicorn worker under load Friday, March 30,

    12
  15. 12 workers per server Friday, March 30, 12

  16. 8 servers Friday, March 30, 12

  17. 12 × 204 MB = 2.4 GB 8 × 2.4

    GB = 19.2 GB for pulling JSON out of a database and returning it Friday, March 30, 12
  18. Unicorns Eat RAM Friday, March 30, 12

  19. Friday, March 30, 12

  20. Friday, March 30, 12

  21. Webmachine Tips 1. Don’t force application logic into resource module

    callbacks 2. Sharing resource functions is simple 3. finish_request for logging, metrics, and error cleanup. 4. Use dispatch args for common resource config Friday, March 30, 12
  22. Webmachine tip #1 1. Don’t force application logic to map

    to resource module callbacks Friday, March 30, 12
  23. Friday, March 30, 12

  24. 1. Parse body in malformed_request 2. halt 404 in forbidden

    Friday, March 30, 12
  25. Webmachine tip #1: Don’t force it forbidden(Req, State) -> try

    validate_headers(wrq:req_headers(Req)), {false, Req, State} catch throw:{org_not_found, Org} -> Msg = <<"organization not found">>, Req2 = wrq:set_resp_body(Msg), Req), {{halt, 404}, Req2, State}; throw:{json_too_large, Msg} -> Req2 = wrq:set_resp_body(<<"ETOOBIG">>), Req), {{halt, 413}, Req2, State}; throw:Why -> Msg = malformed_msg(Why, Req, State), NewReq = wrq:set_resp_body(Msg, Req), {true, NewReq, State} end. Friday, March 30, 12
  26. Webmachine tip #2 2. Sharing resource functions is simple (if

    you share a common state record) Friday, March 30, 12
  27. Webmachine tip #2: shared state record -record(base_state, { reqid, resource_state

    }). Friday, March 30, 12
  28. Webmachine tip #2: helper macro -export([service_available/2, is_authorized/2, finish_request/2]). ?gen_wm_function(chef_rest_wm, service_available).

    ?gen_wm_function(chef_rest_wm, is_authorized). ?gen_wm_function(chef_rest_wm, finish_request). Friday, March 30, 12
  29. Webmachine tip #2: helper macro -define(gen_wm_function(Module, Function), Function(Req, #base_state{}=State) ->

    Module:Function(Req, State)). Friday, March 30, 12
  30. Webmachine tip #3 3. finish_request for logging, metrics, and error

    cleanup. Friday, March 30, 12
  31. Webmachine tip #3: finish_request finish_request(Req, #base_state{reqid = ReqId}=State) -> try

    Code = wrq:response_code(Req), log_request(Req, State), stats_hero:report_metrics(ReqId, Code), stats_hero:stop_worker(ReqId), case Code of 500 -> %% sanitize response body Msg = <<"internal service error">>, Json = ejson:encode({[{<<"error">>, [Msg]}]}), Req1 = wrq:set_resp_header("Content-Type", "application/json", Req), {true, wrq:set_resp_body(Json, Req1), State}; _ -> {true, Req, State} end catch X:Y -> error_logger:error_report({X, Y, erlang:get_stacktrace()}) end. Friday, March 30, 12
  32. Webmachine tip #4 4. Use dispatch args for common resource

    config Friday, March 30, 12
  33. Webmachine tip #4: config via dispatch init([]) -> {ok, Ip}

    = application:get_env(chef_rest, ip), {ok, Port} = application:get_env(chef_rest, port), {ok, Dispatch} = file:consult(filename:join( [filename:dirname( code:which(?MODULE)), "..", "priv", "dispatch.conf"])), WebConfig = [{ip, Ip}, {port, Port}, {log_dir, "priv/log"}, {dispatch, add_resource_init(Dispatch)}], Web = {webmachine_mochiweb, {webmachine_mochiweb, start, [WebConfig]}, permanent, 5000, worker, dynamic}, {ok, { {one_for_one, 10, 10}, [Web]} }. Friday, March 30, 12
  34. Webmachine tip #4 add_resource_init(Dispatch) -> Defaults = default_resource_init(), add_resource_init(Dispatch, Defaults,

    []). add_resource_init([Rule | Rest], Defaults, Acc) -> add_resource_init(Rest, Defaults, [add_init(Rule, Defaults) | Acc]); add_resource_init([], _Defaults, Acc) -> lists:reverse(Acc). add_init({Route, Guard, Module, Init}, Defaults) -> InitParams = Init ++ fetch_custom_init_params(Module, Defaults), {Route, Guard, Module, InitParams}; add_init({Route, Module, Init}, Defaults) -> InitParams = Init ++ fetch_custom_init_params(Module, Defaults), {Route, Module, InitParams}. fetch_custom_init_params(Module, Defaults) -> Exports = proplists:get_value(exports, Module:module_info()), case lists:member({fetch_init_params, 1}, Exports) of true -> Module:fetch_init_params(Defaults); false -> Defaults end. Friday, March 30, 12
  35. How did we do? Friday, March 30, 12

  36. Erlang Ruby idle 19MB 100MB loaded 75MB 204MB Friday, March

    30, 12
  37. Erlang Ruby 600MB 19.2GB Friday, March 30, 12

  38. But wait! There’s more. Friday, March 30, 12

  39. Where is Ruby API spending time? Friday, March 30, 12

  40. DB calls? Friday, March 30, 12

  41. JSON parsing/ rendering? Friday, March 30, 12

  42. Crypto? Friday, March 30, 12

  43. Garbage Collection? Friday, March 30, 12

  44. Garbage Collection! Friday, March 30, 12

  45. >40% CPU in GC Friday, March 30, 12

  46. CPU Usage on Chef Server Friday, March 30, 12

  47. Friday, March 30, 12

  48. Frequent GET/PUT of node JSON Friday, March 30, 12

  49. compaction Friday, March 30, 12

  50. No concurrency accessing a single database (until recently) Friday, March

    30, 12
  51. Database replication unreliable for 1000s of databases. Motivation: Why not

    CouchDB? Friday, March 30, 12
  52. File handle and memory resource leaks Motivation: Why not CouchDB?

    Friday, March 30, 12
  53. It became an operations “thing” Motivation: Why not CouchDB? Friday,

    March 30, 12
  54. What we need in a data store • Happy with

    write heavy load • Support for sophisticated queries • Able to run HA Friday, March 30, 12
  55. Did you consider NoSQL database X? Friday, March 30, 12

  56. Yes, but we also asked: Why not SQL? Friday, March

    30, 12
  57. Measure! basho_bench Friday, March 30, 12

  58. So we replaced Couchdb with MySQL Friday, March 30, 12

  59. while the system was running Friday, March 30, 12

  60. Live Migration: Starts out easy! Friday, March 30, 12

  61. Live Migration in 3 Easy Steps 1.Put org into read-only

    mode 2.Copy from CouchDB to MySQL 3.Route org to Erchef Friday, March 30, 12
  62. It Gets Harder Friday, March 30, 12

  63. Migration Tool 1. Coordinate feature flippers and load balancer config

    2. Move batches of orgs through migration 3. Track status of migration and individual orgs 4. Resume after crash Friday, March 30, 12
  64. Real World Hard Friday, March 30, 12

  65. Migration Tool 1. Track inflight write requests 2. Put org

    into read-only mode 3. Wait for inflight write requests to complete 4. Migrate org data 5. Reconfig/HUP load balancer 6. Handle errors Friday, March 30, 12
  66. Scripting with gen_fsm • Helper methods → states • Server

    state and supervision tree make crash recovery easier • Free REPL Friday, March 30, 12
  67. OTP + gen_fsm =:= Happy Migration Tool Organization Robustness state

    functions ✔ state record ✔ ✔ manager/worker processes ✔ ✔ supervision tree ✔ DETS local store ✔ Friday, March 30, 12
  68. No migration plan survives contact with production http://en.wikiquote.org/wiki/Helmuth_von_Moltke_the_Elder Friday, March

    30, 12
  69. Database CPU CouchDB MySQL Friday, March 30, 12

  70. Database Load Average CouchDB MySQL Friday, March 30, 12

  71. API Average Latency Friday, March 30, 12

  72. Chef Server Roles Endpoint 90th Latency Friday, March 30, 12

  73. Chef Server Roles Endpoint 90th Latency Friday, March 30, 12

  74. Database Memory CouchDB MySQL Friday, March 30, 12

  75. CouchDB Write Requests Friday, March 30, 12

  76. CouchDB Network Traffic Friday, March 30, 12

  77. Network traffic on Chef Server Friday, March 30, 12

  78. Friday, March 30, 12