Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How We Moved Hosted Chef to Erlang + MySQL and Why You Didn't Notice

How We Moved Hosted Chef to Erlang + MySQL and Why You Didn't Notice

Presented at ChefConf 2012.

Hosted Chef's server API is being ported from Ruby/CouchDB to Erlang/MySQL. Find out what motivated this work, what's been accomplished so far, and why Erlang and an RDBMS are good choices for Chef. We will share metrics comparing the performance and operational characteristics of Ruby/CouchDB to Erlang/MySQL and discuss the automation used to change data stores in live high-volume web service.

49b59b4f0027999a551728da1fae3029?s=128

Seth Falcon

May 24, 2012
Tweet

Transcript

  1. How We Moved Hosted Chef to Erlang + MySQL and

    Why You Didn’t Notice Seth Falcon Development Lead @sfalcon 1 Thursday, May 24, 12
  2. Server 2 Thursday, May 24, 12

  3. http://www.flickr.com/photos/emzee/ Hosted Chef Private Chef Open Chef 3 Thursday, May

    24, 12
  4. Setup: Chef Server API • Merb, Ruby, Unicorn, Nginx •

    Stateless, horizontally scalable • Talks to • CouchDB, • authorization service (Erlang), • Solr 4 Thursday, May 24, 12
  5. Typical Chef Server API Request 1.Public key for authentication 2.Node

    data from CouchDB (median 22K, 3rd Qu. 44K) 3.Authorization check 4.POST, GET, PUT, DELETE 5 Thursday, May 24, 12
  6. 6 Thursday, May 24, 12

  7. 7 Thursday, May 24, 12

  8. Average Chef Server API Response Times 500 ms 8 Thursday,

    May 24, 12
  9. CouchDB Uptime 9 Thursday, May 24, 12

  10. Slow, Irregular, and Out of Control 10 Thursday, May 24,

    12
  11. Heavy on system resources 11 Thursday, May 24, 12

  12. Heavy on system resources Why? 12 Thursday, May 24, 12

  13. How much RAM should it use? 13 Thursday, May 24,

    12
  14. 60 req/sec × 44K = 2.7MB 14 Thursday, May 24,

    12
  15. 2.7MB data + code + copies... 27MB? 15 Thursday, May

    24, 12
  16. 100MB at rest, after startup 16 Thursday, May 24, 12

  17. 204 MB per unicorn worker under load 17 Thursday, May

    24, 12
  18. Concurrency? One request per worker. 18 Thursday, May 24, 12

  19. 12 workers per server 19 Thursday, May 24, 12

  20. 8 servers 20 Thursday, May 24, 12

  21. 12 × 204 MB = 2.4 GB 8 × 2.4

    GB = 19.2 GB for pulling JSON out of a database and returning it 21 Thursday, May 24, 12
  22. Unicorns Eat RAM 22 Thursday, May 24, 12

  23. Why do unicorns eat RAM? 23 Thursday, May 24, 12

  24. Ruby MRI Garbage Collector Slab allocator 24 Thursday, May 24,

    12
  25. Ruby MRI Garbage Collector Mark and sweep 25 Thursday, May

    24, 12
  26. Ruby MRI Garbage Collector Mark and sweep All objects touched

    on GC, defeats COW Code == objects 26 Thursday, May 24, 12
  27. Ruby MRI Garbage Collector One large request bloats the worker

    more CPU needed for GC 27 Thursday, May 24, 12
  28. 28 Thursday, May 24, 12

  29. light-weight share nothing processes 29 Thursday, May 24, 12

  30. 2.6 KB per process 7 microseconds to spawn a process

    30 Thursday, May 24, 12
  31. VM handles concurrency Efficient use of multi-core 31 Thursday, May

    24, 12
  32. per-process GC 32 Thursday, May 24, 12

  33. 33 Thursday, May 24, 12

  34. 34 Thursday, May 24, 12

  35. Webmachine a state machine for HTTP you provide the callbacks,

    it does the REST 35 Thursday, May 24, 12
  36. Some hacking happens 36 Thursday, May 24, 12

  37. {“name”: “a_role”, “json_class”: ”Chef::Role”} 37 Thursday, May 24, 12

  38. How did we do? 38 Thursday, May 24, 12

  39. Erlang Ruby idle 19MB 100MB loaded 75MB 204MB 39 Thursday,

    May 24, 12
  40. Erlang Ruby 600MB 19.2GB 40 Thursday, May 24, 12

  41. But wait! There’s more. 41 Thursday, May 24, 12

  42. Where is Ruby API spending time? 42 Thursday, May 24,

    12
  43. DB calls? 43 Thursday, May 24, 12

  44. JSON parsing/ rendering? 44 Thursday, May 24, 12

  45. Crypto? 45 Thursday, May 24, 12

  46. Garbage Collection? 46 Thursday, May 24, 12

  47. Garbage Collection! 47 Thursday, May 24, 12

  48. >40% CPU in GC 48 Thursday, May 24, 12

  49. CPU Usage on Chef Server 49 Thursday, May 24, 12

  50. 50 Thursday, May 24, 12

  51. Frequent GET/PUT of node JSON 51 Thursday, May 24, 12

  52. compaction 52 Thursday, May 24, 12

  53. No concurrency accessing a single database (until recently) 53 Thursday,

    May 24, 12
  54. Database replication unreliable for 1000s of databases. Motivation: Why not

    CouchDB? 54 Thursday, May 24, 12
  55. File handle and memory resource leaks Motivation: Why not CouchDB?

    55 Thursday, May 24, 12
  56. It became an operations “thing” Motivation: Why not CouchDB? 56

    Thursday, May 24, 12
  57. What we need in a data store • Happy with

    write heavy load • Support for sophisticated queries • Able to run HA 57 Thursday, May 24, 12
  58. Did you consider NoSQL database X? 58 Thursday, May 24,

    12
  59. Yes, but we also asked: Why not SQL? 59 Thursday,

    May 24, 12
  60. Measure! basho_bench 60 Thursday, May 24, 12

  61. So we replaced Couchdb with MySQL 61 Thursday, May 24,

    12
  62. while the system was running 62 Thursday, May 24, 12

  63. Live Migration: Starts out easy! 63 Thursday, May 24, 12

  64. Live Migration: Starts out easy! 64 Thursday, May 24, 12

  65. Live Migration in 3 Easy Steps 1.Put org into read-only

    mode 2.Copy from CouchDB to MySQL 3.Route org to Erchef 65 Thursday, May 24, 12
  66. It Gets Harder 66 Thursday, May 24, 12

  67. Migration Tool 1. Coordinate feature flippers and load balancer config

    2. Move batches of orgs through migration 3. Track status of migration and individual orgs 4. Resume after crash 67 Thursday, May 24, 12
  68. Real World Hard 68 Thursday, May 24, 12

  69. Migration Tool 1. Track inflight write requests 2. Put org

    into read-only mode 3. Wait for inflight write requests to complete 4. Migrate org data 5. Reconfig/HUP load balancer 6. Handle errors 69 Thursday, May 24, 12
  70. OTP + gen_fsm =:= Happy Migration Tool Organization Robustness state

    functions ✔ state record ✔ ✔ manager/worker processes ✔ ✔ supervision tree ✔ DETS local store ✔ FREE REPL 70 Thursday, May 24, 12
  71. No migration plan survives contact with production http://en.wikiquote.org/wiki/Helmuth_von_Moltke_the_Elder 71 Thursday,

    May 24, 12
  72. Database CPU CouchDB MySQL 72 Thursday, May 24, 12

  73. Database Load Average CouchDB MySQL 73 Thursday, May 24, 12

  74. API Average Latency 74 Thursday, May 24, 12

  75. Chef Server Roles Endpoint 90th Latency 75 Thursday, May 24,

    12
  76. Chef Server Roles Endpoint 90th Latency 75 Thursday, May 24,

    12
  77. Database Memory CouchDB MySQL 76 Thursday, May 24, 12

  78. CouchDB Write Requests 77 Thursday, May 24, 12

  79. CouchDB Network Traffic 78 Thursday, May 24, 12

  80. Network traffic on Chef Server 79 Thursday, May 24, 12

  81. Progress Report Endpoint Status Nodes Deployed Search Deployed Roles Coded

    Data Bags Coded Environments in-progress Cookbooks in-progress Sandboxes in-progress 80 Thursday, May 24, 12
  82. 81 Thursday, May 24, 12