production: an owner's manual

production: an owner's manual

from exec(ut) 2018

A4b95be2145cc46f891707b6db9dd82d?s=128

Igor Wiedler

April 23, 2018
Tweet

Transcript

  1. production: an owner's manual

  2. hello!

  3. broken computers

  4. None
  5. getting sidetracked now so sorry* * not sorry

  6. None
  7. None
  8. None
  9. back to serious business

  10. !

  11. None
  12. a production system is a system that serves real users

  13. the goal of operations is to ensure services are reliable

  14. in order to provide a good user experience

  15. None
  16. failure

  17. app

  18. app linux kernel cpu dram disk network power supply switches

    load balancer dns submarine cables routers fiber
  19. app linux kernel the cloud

  20. None
  21. • cosmic rays • disk failure • power outages •

    software bugs • ...
  22. entropy

  23. None
  24. capacity

  25. None
  26. None
  27. None
  28. cascading failure

  29. None
  30. system design

  31. redundancy

  32. "

  33. scale

  34. None
  35. "

  36. p1 m3 c1 m2 m1 p2 c2

  37. data storage

  38. "

  39. "

  40. protocols

  41. None
  42. monitoring

  43. many components many req/s

  44. None
  45. measure all the things?

  46. ✅ ⏱

  47. golden signals • latency • traffic • errors • saturation

  48. golden signals • latency • traffic • errors • saturation

  49. golden signals • latency • traffic • errors • saturation

  50. golden signals • latency • traffic • errors • saturation

  51. golden signals • latency • traffic • errors • saturation

    0 - 50 [1620]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (74.55%) 50 - 100 [ 447]: ∎∎∎∎∎∎∎∎∎∎ (20.57%) 100 - 150 [ 49]: ∎ (2.25%) 150 - 200 [ 15]: (0.69%) 200 - 250 [ 15]: (0.69%) 250 - 300 [ 10]: (0.46%) 300 - 350 [ 6]: (0.28%) 350 - 400 [ 1]: (0.05%) 400 - 450 [ 0]: (0.00%) 450 - 500 [ 4]: (0.18%)
  52. golden signals • latency • traffic • errors • saturation

  53. saturation traffic latency errors

  54. None
  55. humans

  56. None
  57. oops, deleted the database

  58. bad human!

  59. why does this button even exist?

  60. app linux kernel cpu dram disk network power supply switches

    load balancer dns submarine cables routers fiber
  61. app linux kernel cpu dram disk network power supply switches

    load balancer dns submarine cables routers fiber humans
  62. app linux kernel cpu dram disk network power supply switches

    load balancer dns submarine cables routers fiber humans h u m a n s
  63. epic failure is almost always systemic

  64. failure

  65. recap

  66. • a production system serves real users • users like

    things that work and are fast • epic failure is almost always systemic
  67. thx @igorwhilefalse

  68. None