$30 off During Our Annual Pro Sale. View Details »

How We Moved Hosted Chef to Erlang + MySQL and Why You Didn't Notice

How We Moved Hosted Chef to Erlang + MySQL and Why You Didn't Notice

Presented at ChefConf 2012.

Hosted Chef's server API is being ported from Ruby/CouchDB to Erlang/MySQL. Find out what motivated this work, what's been accomplished so far, and why Erlang and an RDBMS are good choices for Chef. We will share metrics comparing the performance and operational characteristics of Ruby/CouchDB to Erlang/MySQL and discuss the automation used to change data stores in live high-volume web service.

Seth Falcon

May 24, 2012
Tweet

More Decks by Seth Falcon

Other Decks in Programming

Transcript

  1. How We Moved Hosted Chef to
    Erlang + MySQL
    and Why You Didn’t Notice
    Seth Falcon
    Development Lead
    @sfalcon
    1
    Thursday, May 24, 12

    View Slide

  2. Server
    2
    Thursday, May 24, 12

    View Slide

  3. http://www.flickr.com/photos/emzee/
    Hosted Chef
    Private Chef
    Open Chef
    3
    Thursday, May 24, 12

    View Slide

  4. Setup: Chef Server API
    • Merb, Ruby, Unicorn, Nginx
    • Stateless, horizontally scalable
    • Talks to
    • CouchDB,
    • authorization service (Erlang),
    • Solr
    4
    Thursday, May 24, 12

    View Slide

  5. Typical Chef Server API Request
    1.Public key for authentication
    2.Node data from CouchDB
    (median 22K, 3rd Qu. 44K)
    3.Authorization check
    4.POST, GET, PUT, DELETE
    5
    Thursday, May 24, 12

    View Slide

  6. 6
    Thursday, May 24, 12

    View Slide

  7. 7
    Thursday, May 24, 12

    View Slide

  8. Average Chef Server API Response Times
    500 ms
    8
    Thursday, May 24, 12

    View Slide

  9. CouchDB Uptime
    9
    Thursday, May 24, 12

    View Slide

  10. Slow, Irregular, and Out
    of Control
    10
    Thursday, May 24, 12

    View Slide

  11. Heavy on system resources
    11
    Thursday, May 24, 12

    View Slide

  12. Heavy on system resources
    Why?
    12
    Thursday, May 24, 12

    View Slide

  13. How much RAM should it
    use?
    13
    Thursday, May 24, 12

    View Slide

  14. 60 req/sec × 44K =
    2.7MB
    14
    Thursday, May 24, 12

    View Slide

  15. 2.7MB data + code +
    copies...
    27MB?
    15
    Thursday, May 24, 12

    View Slide

  16. 100MB
    at rest, after startup
    16
    Thursday, May 24, 12

    View Slide

  17. 204 MB
    per unicorn worker
    under load
    17
    Thursday, May 24, 12

    View Slide

  18. Concurrency?
    One request per worker.
    18
    Thursday, May 24, 12

    View Slide

  19. 12 workers per server
    19
    Thursday, May 24, 12

    View Slide

  20. 8 servers
    20
    Thursday, May 24, 12

    View Slide

  21. 12 × 204 MB = 2.4 GB
    8 × 2.4 GB =
    19.2 GB
    for pulling JSON out of a database and returning it
    21
    Thursday, May 24, 12

    View Slide

  22. Unicorns Eat RAM
    22
    Thursday, May 24, 12

    View Slide

  23. Why do unicorns eat
    RAM?
    23
    Thursday, May 24, 12

    View Slide

  24. Ruby MRI Garbage Collector
    Slab allocator
    24
    Thursday, May 24, 12

    View Slide

  25. Ruby MRI Garbage Collector
    Mark and
    sweep
    25
    Thursday, May 24, 12

    View Slide

  26. Ruby MRI Garbage Collector
    Mark and
    sweep
    All objects touched on GC,
    defeats COW
    Code == objects
    26
    Thursday, May 24, 12

    View Slide

  27. Ruby MRI Garbage Collector
    One large request bloats the
    worker
    more CPU needed for GC
    27
    Thursday, May 24, 12

    View Slide

  28. 28
    Thursday, May 24, 12

    View Slide

  29. light-weight
    share nothing
    processes
    29
    Thursday, May 24, 12

    View Slide

  30. 2.6 KB per process
    7 microseconds to
    spawn a process
    30
    Thursday, May 24, 12

    View Slide

  31. VM handles
    concurrency
    Efficient use of
    multi-core
    31
    Thursday, May 24, 12

    View Slide

  32. per-process GC
    32
    Thursday, May 24, 12

    View Slide

  33. 33
    Thursday, May 24, 12

    View Slide

  34. 34
    Thursday, May 24, 12

    View Slide

  35. Webmachine
    a state machine for HTTP
    you provide the callbacks, it
    does the REST
    35
    Thursday, May 24, 12

    View Slide

  36. Some hacking happens
    36
    Thursday, May 24, 12

    View Slide

  37. {“name”: “a_role”,
    “json_class”: ”Chef::Role”}
    37
    Thursday, May 24, 12

    View Slide

  38. How did we do?
    38
    Thursday, May 24, 12

    View Slide

  39. Erlang Ruby
    idle 19MB 100MB
    loaded 75MB 204MB
    39
    Thursday, May 24, 12

    View Slide

  40. Erlang Ruby
    600MB 19.2GB
    40
    Thursday, May 24, 12

    View Slide

  41. But wait! There’s more.
    41
    Thursday, May 24, 12

    View Slide

  42. Where is Ruby API
    spending time?
    42
    Thursday, May 24, 12

    View Slide

  43. DB calls?
    43
    Thursday, May 24, 12

    View Slide

  44. JSON parsing/
    rendering?
    44
    Thursday, May 24, 12

    View Slide

  45. Crypto?
    45
    Thursday, May 24, 12

    View Slide

  46. Garbage Collection?
    46
    Thursday, May 24, 12

    View Slide

  47. Garbage Collection!
    47
    Thursday, May 24, 12

    View Slide

  48. >40% CPU in GC
    48
    Thursday, May 24, 12

    View Slide

  49. CPU Usage on Chef Server
    49
    Thursday, May 24, 12

    View Slide

  50. 50
    Thursday, May 24, 12

    View Slide

  51. Frequent GET/PUT of
    node JSON
    51
    Thursday, May 24, 12

    View Slide

  52. compaction
    52
    Thursday, May 24, 12

    View Slide

  53. No concurrency
    accessing a single
    database (until recently)
    53
    Thursday, May 24, 12

    View Slide

  54. Database replication
    unreliable for 1000s of
    databases.
    Motivation: Why not CouchDB?
    54
    Thursday, May 24, 12

    View Slide

  55. File handle and memory
    resource leaks
    Motivation: Why not CouchDB?
    55
    Thursday, May 24, 12

    View Slide

  56. It became an operations
    “thing”
    Motivation: Why not CouchDB?
    56
    Thursday, May 24, 12

    View Slide

  57. What we need in a data store
    • Happy with write heavy load
    • Support for sophisticated
    queries
    • Able to run HA
    57
    Thursday, May 24, 12

    View Slide

  58. Did you consider NoSQL
    database X?
    58
    Thursday, May 24, 12

    View Slide

  59. Yes, but we also asked:
    Why not SQL?
    59
    Thursday, May 24, 12

    View Slide

  60. Measure!
    basho_bench
    60
    Thursday, May 24, 12

    View Slide

  61. So we replaced
    Couchdb with MySQL
    61
    Thursday, May 24, 12

    View Slide

  62. while the system was running
    62
    Thursday, May 24, 12

    View Slide

  63. Live Migration:
    Starts out easy!
    63
    Thursday, May 24, 12

    View Slide

  64. Live Migration:
    Starts out easy!
    64
    Thursday, May 24, 12

    View Slide

  65. Live Migration in 3 Easy Steps
    1.Put org into read-only mode
    2.Copy from CouchDB to MySQL
    3.Route org to Erchef
    65
    Thursday, May 24, 12

    View Slide

  66. It Gets Harder
    66
    Thursday, May 24, 12

    View Slide

  67. Migration Tool
    1. Coordinate feature flippers and load
    balancer config
    2. Move batches of orgs through migration
    3. Track status of migration and individual
    orgs
    4. Resume after crash
    67
    Thursday, May 24, 12

    View Slide

  68. Real World Hard
    68
    Thursday, May 24, 12

    View Slide

  69. Migration Tool
    1. Track inflight write requests
    2. Put org into read-only mode
    3. Wait for inflight write requests to complete
    4. Migrate org data
    5. Reconfig/HUP load balancer
    6. Handle errors
    69
    Thursday, May 24, 12

    View Slide

  70. OTP + gen_fsm =:= Happy Migration Tool
    Organization Robustness
    state functions ✔
    state record ✔ ✔
    manager/worker processes ✔ ✔
    supervision tree ✔
    DETS local store ✔
    FREE
    REPL
    70
    Thursday, May 24, 12

    View Slide

  71. No migration plan survives contact
    with production
    http://en.wikiquote.org/wiki/Helmuth_von_Moltke_the_Elder
    71
    Thursday, May 24, 12

    View Slide

  72. Database CPU
    CouchDB MySQL
    72
    Thursday, May 24, 12

    View Slide

  73. Database Load Average
    CouchDB MySQL
    73
    Thursday, May 24, 12

    View Slide

  74. API Average Latency
    74
    Thursday, May 24, 12

    View Slide

  75. Chef Server Roles Endpoint 90th Latency
    75
    Thursday, May 24, 12

    View Slide

  76. Chef Server Roles Endpoint 90th Latency
    75
    Thursday, May 24, 12

    View Slide

  77. Database Memory
    CouchDB MySQL
    76
    Thursday, May 24, 12

    View Slide

  78. CouchDB Write Requests
    77
    Thursday, May 24, 12

    View Slide

  79. CouchDB Network Traffic
    78
    Thursday, May 24, 12

    View Slide

  80. Network traffic on Chef Server
    79
    Thursday, May 24, 12

    View Slide

  81. Progress Report
    Endpoint Status
    Nodes Deployed
    Search Deployed
    Roles Coded
    Data Bags Coded
    Environments in-progress
    Cookbooks in-progress
    Sandboxes in-progress
    80
    Thursday, May 24, 12

    View Slide

  82. 81
    Thursday, May 24, 12

    View Slide