Development Was the Easy Part (MagmaConf 2014)

Development Was the Easy Part (MagmaConf 2014)

While writing code that actually works is (in all likelihood) impossible, there are other things that are even harder. This talk is about one of those: production environments. As a developer, you spend almost all of your time in development. That's a shame, because your code will be tested orders of magnitude more times in production. As tautological as it is to say, development and production are incredibly different places. This talk will try to make you surprised and uncomfortable about your own production environment. We'll focus on the ways that metrics, networks, and data stores can be surprising or easy to misunderstand… until it's 3am and your phone won't stop ringing.

4c3ed917e59156a36212d48155831482?s=128

André Arko

June 05, 2014
Tweet

Transcript

  1. None
  2. None
  3. None
  4. Development was the easy part

  5. André Arko @indirect

  6. None
  7. None
  8. Development is very different

  9. from Production

  10. you rn →

  11. you later →

  12. Metrics

  13. Metrics are important

  14. Metrics tell you what is happening

  15. Metrics convince you you understand

  16. Averages convince you you understand

  17. Averages are lie-candy for your brain

  18. Averages 5 -5 -4 -3 -2 -1 0 1 2

    3 4 0 0.1 0.2 0.3 0.4
  19. Averages 5 -5 -4 -3 -2 -1 0 1 2

    3 4 0 0.1 0.2 0.3 0.4
  20. None
  21. None
  22. just heard “w e have a great average” →

  23. Averages mask problems

  24. 10 0 1 2 3 4 5 6 7 8

    9 250 0 50 100 150 200
  25. Graph the median

  26. 10 0 1 2 3 4 5 6 7 8

    9 250 0 50 100 150 200
  27. Graph 95th percentile

  28. 10 0 1 2 3 4 5 6 7 8

    9 250 0 50 100 150 200
  29. Graph 99th percentile

  30. 10 0 1 2 3 4 5 6 7 8

    9 1000 0 250 500 750
  31. Aggregate graphs another kind of average

  32. None
  33. Breakout graphs see individual variations

  34. None
  35. Aggregate alerts more dead servers than alive servers

  36. site’s up if any servers are up!

  37. Breakout alerts first dead server not all the servers

  38. Servers

  39. Servers you have no idea what is going on

  40. really.

  41. Routing

  42. Routing your app has this

  43. Routing how does it work?

  44. Development App You

  45. Production People Router Server App App Router Server App App

    Router
  46. Routing how slow is it?

  47. Routing does it back up?

  48. Request time

  49. Request time not the time you measure

  50. Request time wall-clock time from real clients

  51. Request time make requests from around the world

  52. VM lag

  53. VM lag do you have it?

  54. VM lag do you check for it?

  55. VM lag do you know how to check for it?

  56. Runtime lag

  57. Runtime lag how do you tell you lost consciousness?

  58. Runtime lag do you have it?

  59. Runtime lag do you have it? you have it.

  60. Runtime lag do you have it? you have it. how

    bad is it?
  61. Data stores

  62. Data stores in production

  63. Data stores in production are distributed

  64. what does that mean?

  65. your experience (so far) is wrong

  66. Saving data

  67. Saving data tries to save your data

  68. Saving data might save your data

  69. Replication

  70. Replication is not data- saving magic

  71. Replication tries to save your data…

  72. Replication tries to save your data… repeatedly

  73. Postgres

  74. Postgres totally safe, right?

  75. Postgres async replication

  76. Postgres network failures lose “saved” data

  77. Redis

  78. Redis is single-threaded

  79. Redis has no failover

  80. Redis-sentinel elects a new leader

  81. Redis-sentinel throws away non- winners’ writes

  82. Mongo (gem < 1.8) returns before the first write

  83. Mongo (gem < 1.8) your data is on zero disks

    so far
  84. Mongo replication sets default to one write

  85. Mongo demand N copies survive N-1 failures

  86. trust no one

  87. if you didn’t try it you are guessing

  88. test it yourself

  89. what did we learn? So, in the end

  90. Production is fundamentally

  91. Production is fundamentally systemically

  92. Production is fundamentally systemically different

  93. Failures will happen

  94. Failures can be resisted

  95. Failures should not result in one-off patches

  96. Survival requires systematic trials & testing

  97. Development is not like production