Development Was the Easy Part (Scottish Ruby 2014)

Development Was the Easy Part (Scottish Ruby 2014)

While writing code that actually works is (in all likelihood) impossible, there are other things that are even harder. This talk is about one of those: production environments. As a developer, you spend almost all of your time in development. That's a shame, because your code will be tested orders of magnitude more times in production. As tautological as it is to say, development and production are incredibly different places. This talk will try to make you surprised and uncomfortable about your own production environment. We'll focus on the ways that metrics, networks, and data stores can be surprising or easy to misunderstand… until it's 3am and your phone won't stop ringing.

4c3ed917e59156a36212d48155831482?s=128

André Arko

May 13, 2014
Tweet

Transcript

  1. Development was the easy part

  2. André Arko @indirect

  3. None
  4. None
  5. Development is very different

  6. from Production

  7. you rn →

  8. you later →

  9. Metrics

  10. Metrics are important

  11. Metrics tell you what is happening

  12. Metrics convince you you understand

  13. Averages convince you you understand

  14. Averages are lie-candy for your brain

  15. Averages 5 -5 -4 -3 -2 -1 0 1 2

    3 4 0 0.1 0.2 0.3 0.4
  16. Averages 5 -5 -4 -3 -2 -1 0 1 2

    3 4 0 0.1 0.2 0.3 0.4
  17. None
  18. None
  19. just heard “we have a great average” →

  20. Averages mask problems

  21. 10 0 1 2 3 4 5 6 7 8

    9 250 0 50 100 150 200
  22. Graph the median

  23. 10 0 1 2 3 4 5 6 7 8

    9 250 0 50 100 150 200
  24. Graph 95th percentile

  25. 10 0 1 2 3 4 5 6 7 8

    9 250 0 50 100 150 200
  26. Graph 99th percentile

  27. 10 0 1 2 3 4 5 6 7 8

    9 1000 0 250 500 750
  28. Aggregate graphs another kind of average

  29. None
  30. Breakout graphs see individual variations

  31. None
  32. None
  33. Aggregate alerts more dead servers than alive servers

  34. site’s up if any servers are up!

  35. Breakout alerts first dead server not all the servers

  36. Servers

  37. Servers you have no idea what is going on

  38. really.

  39. Routing

  40. Routing your app has this

  41. Routing how does it work?

  42. Development App You

  43. Production People Router Server App App Router Server App App

    Router
  44. Routing how slow is it?

  45. Routing does it back up?

  46. Request time

  47. Request time not the time you measure

  48. Request time wall-clock time from real clients

  49. Request time make requests from around the world

  50. Request time graph them

  51. Request time graph them alert on them

  52. Request time graph them alert on them thank me later

  53. VM lag

  54. VM lag do you have it?

  55. VM lag do you check for it?

  56. VM lag do you know how to check for it?

  57. Runtime lag

  58. Runtime lag how do you tell you lost consciousness?

  59. Runtime lag do you have it?

  60. Runtime lag do you have it? you have it.

  61. Runtime lag do you have it? you have it. how

    bad is it?
  62. Data stores

  63. Data stores in production

  64. Data stores in production are distributed

  65. what does that mean?

  66. your experience (so far) is wrong

  67. Saving data

  68. Saving data tries to save your data

  69. Saving data might save your data

  70. Replication

  71. Replication is not data- saving magic

  72. Replication tries to save your data…

  73. Replication tries to save your data… repeatedly

  74. Postgres

  75. Postgres totally safe, right?

  76. Postgres async replication

  77. Postgres network failures lose “saved” data

  78. Redis

  79. Redis is single-threaded

  80. Redis has no failover

  81. Redis-sentinel elects a new leader

  82. Redis-sentinel throws away non- winners’ writes

  83. Mongo (gem < 1.8) returns before the first write

  84. Mongo (gem < 1.8) your data is on zero disks

    so far
  85. Mongo replication sets default to one write

  86. Mongo demand N copies survive N-1 failures

  87. trust no one

  88. if you didn’t try it you are guessing

  89. test it yourself

  90. So, in the end what did we learn?

  91. Production is fundamentally

  92. Production is fundamentally systemically

  93. Production is fundamentally systemically different

  94. Failures will happen

  95. Failures can be resisted

  96. Failures should not result in one-off patches

  97. Survival requires systematic trials & testing

  98. Development is not like production