Save 37% off PRO during our Black Friday Sale! »

Bloated Chefs: A Tale of Gluttony and the Path to Enlightenment

Bloated Chefs: A Tale of Gluttony and the Path to Enlightenment

As your infrastructure grows and your recipes become more complex, you may suddenly find yourself with chef-client runs that take on the order of minutes to complete. With Chef being the primary mechanism for pushing critical fixes in many orgs, the amount of time it takes for the fleet to converge is of the utmost importance. Non-performant chef-client runs will impact both agility and your ability to scale, so keeping them lean can make a large impact in your operational capacity. You will hear tales of chef-client run duration horror, and how we at PagerDuty have brought our chef-client runs back to the land of ponies and rainbows.

C8a8889a30543fdb8cf2841a19d43834?s=128

Evan Gilman

April 02, 2015
Tweet

Transcript

  1. 4/3/15 @evan2645 EVAN GILMAN Bloated Chefs A Tale of Gluttony,

    and the Path to Enlightenment
  2. 4/3/15 BLOATED CHEFS

  3. 4/3/15 Agenda BLOATED CHEFS 1. Chef resources in use at

    PD 2. Problems encountered as we grew 3. Measuring chef-client run 4. How we fixed it 5. How fast is it now?
  4. 4/3/15 BLOATED CHEFS CHEF @ PAGERDUTY

  5. 4/3/15 BLOATED CHEFS PD CHEF RESOURCES

  6. 4/3/15 pd_iptables BLOATED CHEFS

  7. 4/3/15 pd-ipsec::policies BLOATED CHEFS

  8. 4/3/15 sumo_source BLOATED CHEFS

  9. 4/3/15 pd_datadog_alert BLOATED CHEFS

  10. 4/3/15 BLOATED CHEFS

  11. 4/3/15 BLOATED CHEFS ALL WAS NOT WELL

  12. 4/3/15 As we grew… BLOATED CHEFS

  13. 4/3/15 As we grew… BLOATED CHEFS • CPU spikes during

    chef-client runs
  14. 4/3/15 As we grew… BLOATED CHEFS • CPU spikes during

    chef-client runs • Awkward pauses at the beginning of the run
  15. 4/3/15 As we grew… BLOATED CHEFS • CPU spikes during

    chef-client runs • Awkward pauses at the beginning of the run • chef-client run took several minutes
  16. 4/3/15 As we grew… BLOATED CHEFS • CPU spikes during

    chef-client runs • Awkward pauses at the beginning of the run • chef-client run took several minutes • chef-client OOM
  17. 4/3/15 As we grew… BLOATED CHEFS • CPU spikes during

    chef-client runs • Awkward pauses at the beginning of the run • chef-client run took several minutes • chef-client OOM
  18. 4/3/15 BLOATED CHEFS

  19. 4/3/15 BLOATED CHEFS

  20. 4/3/15 BLOATED CHEFS MEASURING

  21. 4/3/15 Measuring Run Time BLOATED CHEFS

  22. 4/3/15 Measuring Run Time BLOATED CHEFS https://github.com/joemiller/chef-handler-profiler

  23. 4/3/15 Measuring Resources BLOATED CHEFS • Total number of resources

    per run, by type • Number of updated resources per run, by type
  24. 4/3/15 Measuring Memory BLOATED CHEFS • Gather proc stats with

    sys-proctable • Gather GC stats • Can be emitted as statsd
  25. 4/3/15 Measuring Memory BLOATED CHEFS

  26. 4/3/15 BLOATED CHEFS WHAT WE FOUND AND WHAT WE DID

  27. 4/3/15 Step-through Searches BLOATED CHEFS

  28. 4/3/15 Step-through Searches BLOATED CHEFS From this

  29. 4/3/15 Step-through Searches BLOATED CHEFS From this To this

  30. 4/3/15 Step-through Searches BLOATED CHEFS 417MB -> 190MB

  31. 4/3/15 Step-through Searches BLOATED CHEFS 417MB -> 190MB ~54%

  32. 4/3/15 Partial Searches BLOATED CHEFS

  33. 4/3/15 Partial Searches BLOATED CHEFS • Provide hash map of

    desired results
  34. 4/3/15 Partial Searches BLOATED CHEFS • Provide hash map of

    desired results • Minimizes volume of node data returned/handled
  35. 4/3/15 Partial Searches BLOATED CHEFS • Provide hash map of

    desired results • Minimizes volume of node data returned/handled • hash2node
  36. 4/3/15 Partial Searches BLOATED CHEFS • Provide hash map of

    desired results • Minimizes volume of node data returned/handled • hash2node • Two searches touched
  37. 4/3/15 Partial Searches BLOATED CHEFS • Provide hash map of

    desired results • Minimizes volume of node data returned/handled • hash2node • Two searches touched 90s -> 60s
  38. 4/3/15 Partial Searches BLOATED CHEFS • Provide hash map of

    desired results • Minimizes volume of node data returned/handled • hash2node • Two searches touched 90s -> 60s 30%
  39. 4/3/15 Result Memoization BLOATED CHEFS

  40. 4/3/15 Result Memoization BLOATED CHEFS • Common search data

  41. 4/3/15 Result Memoization BLOATED CHEFS • Common search data •

    API-backed LWRP’s
  42. 4/3/15 Result Memoization BLOATED CHEFS • Common search data •

    API-backed LWRP’s • Can be generalized
  43. 4/3/15 Result Memoization BLOATED CHEFS • Common search data •

    API-backed LWRP’s • Can be generalized
  44. 4/3/15 API Tarpitting BLOATED CHEFS

  45. 4/3/15 API Tarpitting BLOATED CHEFS Centralize calls

  46. 4/3/15 BLOATED CHEFS OTHER NASTIES

  47. 4/3/15 Other Nasties BLOATED CHEFS • Too many conditional guards

  48. 4/3/15 Other Nasties BLOATED CHEFS • Too many conditional guards

    • tmpfs storage
  49. 4/3/15 Other Nasties BLOATED CHEFS • Too many conditional guards

    • tmpfs storage • Multiple package resources (Chef 12)
  50. 4/3/15 Other Nasties BLOATED CHEFS • Too many conditional guards

    • tmpfs storage • Multiple package resources (Chef 12) Six seconds for twelve packages
  51. 4/3/15 BLOATED CHEFS BEFORE/AFTER

  52. 4/3/15 Memory Saved BLOATED CHEFS Before: After:

  53. 4/3/15 Memory Saved BLOATED CHEFS Before: ~500MB After:

  54. 4/3/15 Memory Saved BLOATED CHEFS Before: ~500MB After: ~60MB

  55. 4/3/15 Memory Saved BLOATED CHEFS Before: ~500MB After: ~60MB 88%

    less memory!
  56. 4/3/15 Seconds Saved BLOATED CHEFS Before: After:

  57. 4/3/15 Seconds Saved BLOATED CHEFS Before: ~180s/run After:

  58. 4/3/15 Seconds Saved BLOATED CHEFS Before: ~180s/run After: ~30s/run

  59. 4/3/15 Seconds Saved BLOATED CHEFS Before: ~180s/run After: ~30s/run ~84%

    faster!
  60. 4/3/15 BLOATED CHEFS FREEDOM

  61. 4/3/15 Thank you. @evan2645 EVAN GILMAN