Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Migrate a Web App to Erlang, Change Databases, and Not Have Your Customers Notice

How to Migrate a Web App to Erlang, Change Databases, and Not Have Your Customers Notice

In which we present a case study of migrating a high volume web API from Ruby/CouchDB to Erlang/MySQL.

Seth Falcon

March 30, 2012
Tweet

More Decks by Seth Falcon

Other Decks in Programming

Transcript

  1. How to Migrate a Web App to Erlang,
    Change Databases, and Not Have Your
    Customers Notice
    Seth Falcon
    Development Lead
    @sfalcon
    Friday, March 30, 12

    View full-size slide

  2. Setup: Chef
    • Infrastructure as code
    • Describe server config using
    Ruby DSL
    • Client/Server. Your servers run
    chef-client, talk to Chef server
    • It’s awesome.
    Friday, March 30, 12

    View full-size slide

  3. Setup: Chef Server API
    • Merb, Ruby, Unicorn, Nginx
    • Stateless, horizontally scalable
    • Talks to
    • CouchDB,
    • authorization service (Erlang),
    • Solr
    Friday, March 30, 12

    View full-size slide

  4. Typical Chef Server API Request
    1. User public key for authentication
    2. Node data from CouchDB (median 22K, 3rd Qu. 44K)
    3. Authorization check
    4. POST, GET, PUT, DELETE
    Friday, March 30, 12

    View full-size slide

  5. Average Chef Server API Response Times
    Friday, March 30, 12

    View full-size slide

  6. Slow, Irregular, and Out
    of Control
    Friday, March 30, 12

    View full-size slide

  7. CouchDB Uptime
    Friday, March 30, 12

    View full-size slide

  8. Heavy on system resources
    Friday, March 30, 12

    View full-size slide

  9. How much RAM should it
    use?
    Friday, March 30, 12

    View full-size slide

  10. 60 req/sec × 44K =
    2.7MB
    Friday, March 30, 12

    View full-size slide

  11. 2.7MB data + code +
    copies...
    27MB?
    Friday, March 30, 12

    View full-size slide

  12. 100MB
    at rest, after startup
    Friday, March 30, 12

    View full-size slide

  13. Concurrency?
    One request per worker.
    Friday, March 30, 12

    View full-size slide

  14. 204 MB
    per unicorn worker
    under load
    Friday, March 30, 12

    View full-size slide

  15. 12 workers per server
    Friday, March 30, 12

    View full-size slide

  16. 8 servers
    Friday, March 30, 12

    View full-size slide

  17. 12 × 204 MB = 2.4 GB
    8 × 2.4 GB =
    19.2 GB
    for pulling JSON out of a database and returning it
    Friday, March 30, 12

    View full-size slide

  18. Unicorns Eat RAM
    Friday, March 30, 12

    View full-size slide

  19. Friday, March 30, 12

    View full-size slide

  20. Friday, March 30, 12

    View full-size slide

  21. Webmachine Tips
    1. Don’t force application logic
    into resource module
    callbacks
    2. Sharing resource functions
    is simple
    3. finish_request for logging,
    metrics, and error cleanup.
    4. Use dispatch args for
    common resource config
    Friday, March 30, 12

    View full-size slide

  22. Webmachine tip #1
    1. Don’t force application logic
    to map to resource module
    callbacks
    Friday, March 30, 12

    View full-size slide

  23. Friday, March 30, 12

    View full-size slide

  24. 1. Parse body in
    malformed_request
    2. halt 404 in forbidden
    Friday, March 30, 12

    View full-size slide

  25. Webmachine tip #1: Don’t force it
    forbidden(Req, State) ->
    try
    validate_headers(wrq:req_headers(Req)),
    {false, Req, State}
    catch
    throw:{org_not_found, Org} ->
    Msg = <<"organization not found">>,
    Req2 = wrq:set_resp_body(Msg), Req),
    {{halt, 404}, Req2, State};
    throw:{json_too_large, Msg} ->
    Req2 = wrq:set_resp_body(<<"ETOOBIG">>), Req),
    {{halt, 413}, Req2, State};
    throw:Why ->
    Msg = malformed_msg(Why, Req, State),
    NewReq = wrq:set_resp_body(Msg, Req),
    {true, NewReq, State}
    end.
    Friday, March 30, 12

    View full-size slide

  26. Webmachine tip #2
    2. Sharing resource functions
    is simple (if you share a
    common state record)
    Friday, March 30, 12

    View full-size slide

  27. Webmachine tip #2: shared state record
    -record(base_state, {
    reqid,
    resource_state
    }).
    Friday, March 30, 12

    View full-size slide

  28. Webmachine tip #2: helper macro
    -export([service_available/2,
    is_authorized/2,
    finish_request/2]).
    ?gen_wm_function(chef_rest_wm, service_available).
    ?gen_wm_function(chef_rest_wm, is_authorized).
    ?gen_wm_function(chef_rest_wm, finish_request).
    Friday, March 30, 12

    View full-size slide

  29. Webmachine tip #2: helper macro
    -define(gen_wm_function(Module, Function),
    Function(Req, #base_state{}=State) ->
    Module:Function(Req, State)).
    Friday, March 30, 12

    View full-size slide

  30. Webmachine tip #3
    3. finish_request for logging,
    metrics, and error cleanup.
    Friday, March 30, 12

    View full-size slide

  31. Webmachine tip #3: finish_request
    finish_request(Req, #base_state{reqid = ReqId}=State) ->
    try
    Code = wrq:response_code(Req),
    log_request(Req, State),
    stats_hero:report_metrics(ReqId, Code),
    stats_hero:stop_worker(ReqId),
    case Code of
    500 ->
    %% sanitize response body
    Msg = <<"internal service error">>,
    Json = ejson:encode({[{<<"error">>, [Msg]}]}),
    Req1 = wrq:set_resp_header("Content-Type",
    "application/json", Req),
    {true, wrq:set_resp_body(Json, Req1), State};
    _ ->
    {true, Req, State}
    end
    catch
    X:Y ->
    error_logger:error_report({X, Y,
    erlang:get_stacktrace()})
    end.
    Friday, March 30, 12

    View full-size slide

  32. Webmachine tip #4
    4. Use dispatch args for
    common resource config
    Friday, March 30, 12

    View full-size slide

  33. Webmachine tip #4: config via dispatch
    init([]) ->
    {ok, Ip} = application:get_env(chef_rest, ip),
    {ok, Port} = application:get_env(chef_rest, port),
    {ok, Dispatch} = file:consult(filename:join(
    [filename:dirname(
    code:which(?MODULE)),
    "..", "priv", "dispatch.conf"])),
    WebConfig = [{ip, Ip},
    {port, Port},
    {log_dir, "priv/log"},
    {dispatch, add_resource_init(Dispatch)}],
    Web = {webmachine_mochiweb,
    {webmachine_mochiweb, start, [WebConfig]},
    permanent, 5000, worker, dynamic},
    {ok, { {one_for_one, 10, 10}, [Web]} }.
    Friday, March 30, 12

    View full-size slide

  34. Webmachine tip #4
    add_resource_init(Dispatch) ->
    Defaults = default_resource_init(),
    add_resource_init(Dispatch, Defaults, []).
    add_resource_init([Rule | Rest], Defaults, Acc) ->
    add_resource_init(Rest, Defaults, [add_init(Rule, Defaults) | Acc]);
    add_resource_init([], _Defaults, Acc) ->
    lists:reverse(Acc).
    add_init({Route, Guard, Module, Init}, Defaults) ->
    InitParams = Init ++ fetch_custom_init_params(Module, Defaults),
    {Route, Guard, Module, InitParams};
    add_init({Route, Module, Init}, Defaults) ->
    InitParams = Init ++ fetch_custom_init_params(Module, Defaults),
    {Route, Module, InitParams}.
    fetch_custom_init_params(Module, Defaults) ->
    Exports = proplists:get_value(exports, Module:module_info()),
    case lists:member({fetch_init_params, 1}, Exports) of
    true -> Module:fetch_init_params(Defaults);
    false -> Defaults
    end.
    Friday, March 30, 12

    View full-size slide

  35. How did we do?
    Friday, March 30, 12

    View full-size slide

  36. Erlang Ruby
    idle 19MB 100MB
    loaded 75MB 204MB
    Friday, March 30, 12

    View full-size slide

  37. Erlang Ruby
    600MB 19.2GB
    Friday, March 30, 12

    View full-size slide

  38. But wait! There’s more.
    Friday, March 30, 12

    View full-size slide

  39. Where is Ruby API
    spending time?
    Friday, March 30, 12

    View full-size slide

  40. DB calls?
    Friday, March 30, 12

    View full-size slide

  41. JSON parsing/
    rendering?
    Friday, March 30, 12

    View full-size slide

  42. Crypto?
    Friday, March 30, 12

    View full-size slide

  43. Garbage Collection?
    Friday, March 30, 12

    View full-size slide

  44. Garbage Collection!
    Friday, March 30, 12

    View full-size slide

  45. >40% CPU in GC
    Friday, March 30, 12

    View full-size slide

  46. CPU Usage on Chef Server
    Friday, March 30, 12

    View full-size slide

  47. Friday, March 30, 12

    View full-size slide

  48. Frequent GET/PUT of
    node JSON
    Friday, March 30, 12

    View full-size slide

  49. compaction
    Friday, March 30, 12

    View full-size slide

  50. No concurrency
    accessing a single
    database (until recently)
    Friday, March 30, 12

    View full-size slide

  51. Database replication
    unreliable for 1000s of
    databases.
    Motivation: Why not CouchDB?
    Friday, March 30, 12

    View full-size slide

  52. File handle and memory
    resource leaks
    Motivation: Why not CouchDB?
    Friday, March 30, 12

    View full-size slide

  53. It became an operations
    “thing”
    Motivation: Why not CouchDB?
    Friday, March 30, 12

    View full-size slide

  54. What we need in a data store
    • Happy with write heavy load
    • Support for sophisticated
    queries
    • Able to run HA
    Friday, March 30, 12

    View full-size slide

  55. Did you consider NoSQL
    database X?
    Friday, March 30, 12

    View full-size slide

  56. Yes, but we also asked:
    Why not SQL?
    Friday, March 30, 12

    View full-size slide

  57. Measure!
    basho_bench
    Friday, March 30, 12

    View full-size slide

  58. So we replaced
    Couchdb with MySQL
    Friday, March 30, 12

    View full-size slide

  59. while the system was running
    Friday, March 30, 12

    View full-size slide

  60. Live Migration:
    Starts out easy!
    Friday, March 30, 12

    View full-size slide

  61. Live Migration in 3 Easy Steps
    1.Put org into read-only mode
    2.Copy from CouchDB to MySQL
    3.Route org to Erchef
    Friday, March 30, 12

    View full-size slide

  62. It Gets Harder
    Friday, March 30, 12

    View full-size slide

  63. Migration Tool
    1. Coordinate feature flippers and load
    balancer config
    2. Move batches of orgs through migration
    3. Track status of migration and individual
    orgs
    4. Resume after crash
    Friday, March 30, 12

    View full-size slide

  64. Real World Hard
    Friday, March 30, 12

    View full-size slide

  65. Migration Tool
    1. Track inflight write requests
    2. Put org into read-only mode
    3. Wait for inflight write requests to complete
    4. Migrate org data
    5. Reconfig/HUP load balancer
    6. Handle errors
    Friday, March 30, 12

    View full-size slide

  66. Scripting with gen_fsm
    • Helper methods → states
    • Server state and supervision tree make
    crash recovery easier
    • Free REPL
    Friday, March 30, 12

    View full-size slide

  67. OTP + gen_fsm =:= Happy Migration Tool
    Organization Robustness
    state functions ✔
    state record ✔ ✔
    manager/worker processes ✔ ✔
    supervision tree ✔
    DETS local store ✔
    Friday, March 30, 12

    View full-size slide

  68. No migration plan survives contact
    with production
    http://en.wikiquote.org/wiki/Helmuth_von_Moltke_the_Elder
    Friday, March 30, 12

    View full-size slide

  69. Database CPU
    CouchDB MySQL
    Friday, March 30, 12

    View full-size slide

  70. Database Load Average
    CouchDB MySQL
    Friday, March 30, 12

    View full-size slide

  71. API Average Latency
    Friday, March 30, 12

    View full-size slide

  72. Chef Server Roles Endpoint 90th Latency
    Friday, March 30, 12

    View full-size slide

  73. Chef Server Roles Endpoint 90th Latency
    Friday, March 30, 12

    View full-size slide

  74. Database Memory
    CouchDB MySQL
    Friday, March 30, 12

    View full-size slide

  75. CouchDB Write Requests
    Friday, March 30, 12

    View full-size slide

  76. CouchDB Network Traffic
    Friday, March 30, 12

    View full-size slide

  77. Network traffic on Chef Server
    Friday, March 30, 12

    View full-size slide

  78. Friday, March 30, 12

    View full-size slide