Development was the easy part (RubyNation 2015)

Development was the easy part (RubyNation 2015)

As developers, we spend our time in development environments, but our users only see our code in production. The differences between development and production can be extremely surprising, and surprises are bad when your users and revenue are at stake. I'll talk about the most common surprises of production environments, including metrics, networks, hardware, and data stores. By the end of the talk, you'll know how to use your existing tools to monitor and even improve your production environment, as well as gotchas to watch out for.

4c3ed917e59156a36212d48155831482?s=128

André Arko

June 12, 2015
Tweet

Transcript

  1. Development was the easy part

  2. Development was the easy part

  3. André Arko @indirect

  4. None
  5. The Ruby Way therubyway.io

  6. stickers!

  7. None
  8. stickers!

  9. None
  10. stickers!

  11. Development is very different

  12. from Production

  13. you rn →

  14. you later →

  15. Metrics

  16. Metrics are important

  17. Metrics tell you what is happening

  18. Metrics convince you you understand

  19. Averages convince you you understand

  20. Averages are lie-candy for your brain

  21. Averages 5 -5 -4 -3 -2 -1 0 1 2

    3 4 0 0.1 0.2 0.3 0.4
  22. Averages 5 -5 -4 -3 -2 -1 0 1 2

    3 4 0 0.1 0.2 0.3 0.4
  23. None
  24. None
  25. just heard “we have a great average” →

  26. Averages mask problems

  27. 10 0 1 2 3 4 5 6 7 8

    9 250 0 50 100 150 200
  28. Graph the median

  29. 10 0 1 2 3 4 5 6 7 8

    9 250 0 50 100 150 200
  30. Graph 95th percentile

  31. 10 0 1 2 3 4 5 6 7 8

    9 250 0 50 100 150 200
  32. Graph 99th percentile

  33. 10 0 1 2 3 4 5 6 7 8

    9 1000 0 250 500 750
  34. Aggregate graphs another kind of average

  35. None
  36. Breakout graphs see individual variations

  37. None
  38. None
  39. Aggregate alerts more dead servers than alive servers

  40. site’s up if any servers are up!

  41. Breakout alerts first dead server not all the servers

  42. Servers

  43. you have no idea what is going on Servers

  44. really.

  45. Routing

  46. your app has this Routing

  47. how does it work? Routing

  48. Development App You

  49. People Router Server App App Router Server App App Router

    Production
  50. how slow is it? Routing

  51. does it back up? Routing

  52. Request

  53. not the time you measure Request

  54. wall-clock time from real clients Request

  55. make requests from around the world Request

  56. graph them Request

  57. graph them alert on them Request

  58. graph them alert on them thank me later Request

  59. VM lag

  60. do you have it? VM lag

  61. do you check for it? VM lag

  62. do you know how to check for it? VM lag

  63. Runtime

  64. how do you tell you lost consciousness? Runtime

  65. do you have it? Runtime

  66. do you have it? you have it. Runtime

  67. do you have it? you have it. how bad is

    it? Runtime
  68. Data Stores

  69. in production Data Stores

  70. in production are distributed Data Stores

  71. what does that mean?

  72. your experience (so far) is wrong

  73. Saving data

  74. tries to save your data Saving data

  75. might save your data Saving data

  76. Replication

  77. is not data- saving magic Replication

  78. tries to save your data… Replication

  79. tries to save your data… repeatedly Replication

  80. Postgres

  81. totally safe, right? Postgres

  82. async replication Postgres

  83. network failures lose “saved” data Postgres

  84. Redis

  85. is single-threaded Redis

  86. has no failover Redis

  87. Mongo replication sets default to one write

  88. Mongo demand N copies survive N-1 failures

  89. trust no one

  90. if you didn’t try it yourself you are guessing

  91. test it yourself

  92. So, in the end what did we learn?

  93. Production is fundamentally

  94. Production is fundamentally systemically

  95. Production is fundamentally systemically different

  96. Failures will happen

  97. Failures can be resisted

  98. Failures should not result in one-off patches

  99. Survival requires systematic trials & testing

  100. Good luck! %☁