Development is just the tip of the iceberg

4c3ed917e59156a36212d48155831482?s=47 André Arko
October 10, 2013

Development is just the tip of the iceberg

Delivered at DevCon TLV in Tel Aviv, Israel, this talk focuses on some of the common ways that production environments are different from development environments. Even though developers interact with development 90% of the time, users are _always_ interacting with code in production. It's incredibly important to remember not just that they aren't the same, but how they are different. Armed with that knowledge, we can make tradeoffs in production that produce the best results possible for our particular software.

4c3ed917e59156a36212d48155831482?s=128

André Arko

October 10, 2013
Tweet

Transcript

  1. Development is  just  the  tip of  the  iceberg

  2. André  Arko @indirect    

  3. None
  4. None
  5. DANGER PRODUCTION  AHEAD

  6. Metrics

  7. Metrics are  important

  8. Metrics tell  you  what is  happening

  9. Metrics convince  you you  understand

  10. Averages convince  you you  understand

  11. but  brains  are pretty  weird

  12. you  probably  don’t understand  averages

  13. Average (right?)

  14. None
  15. None
  16. !

  17. Averages mask  problems

  18. Averages !

  19. Instead graph  the  full distribution

  20. Instead graph  median, mean,  and  95th

  21. Aggregates another  kind of  average

  22. Srsly  guise breakout  graphs

  23. Srsly  guise alert  on  broken-­‐ out  metrics

  24. Srsly  guise alerts  on  aggregates are  probably  too  late

  25. Servers

  26. Servers you  have  no  idea  what  is   going  on

  27. really.

  28. it’s  3am. do  you  know  where  your   application  is?

  29. Routing your  app  has  this

  30. Routing how  slow  is  it?

  31. Routing does  it  back  up?

  32. Request  time

  33. Request  time not  your  metrics, I  mean  for  real

  34. Request  time make  requests from  all  over  

  35. Request  time graph  them

  36. Request  time graph  them alert  on  them  

  37. Request  time graph  them alert  on  them thank  me  later

  38. Runtime  lag

  39. Runtime  lag (how  do  you  tell  you  lost   consciousness?)

  40. Runtime  lag do  you  have  it?

  41. Runtime  lag do  you  have  it? (yes)

  42. Runtime  lag how  bad  is  it?

  43. Runtime  lag how  do  you  track  it?

  44. VM  lag

  45. VM  lag do  you  have  it?

  46. VM  lag do  you  even check  for  it?

  47. VM  lag do  you  know  how to  check  for  it?

  48. Data  stores

  49. Data  stores in  production

  50. Data  stores in  production are  distributed

  51. what  does that  mean?

  52. your  experience (so  far)  is  wrong

  53. Saving  data

  54. Saving  data tries  to  save your  data

  55. Saving  data might  save your  data

  56. Replication

  57. Replication doesn’t  save  you

  58. Postgres async  replication

  59. Postgres network  failures can  lose  saved  data

  60. Redis has  no  failover

  61. Redis-­‐sentinel elects  a  new  leader

  62. Redis-­‐sentinel keeps  one  leader’s  saves   during  failures

  63. Mongo returns  before the  first  write

  64. Mongo your  data  is  on zero  disks  (so  far)

  65. Mongo demand  N  copies survive  N-­‐1  failures

  66. trust  no  one

  67. if  you  didn’t  try  it, you  are  guessing

  68. try  it  yourself

  69. So what  did  we  learn?

  70. Production is  fundamentally

  71. Production is  fundamentally systemically

  72. Production is  fundamentally systemically different

  73. Failures will  happen

  74. Failures can  be  resisted

  75. Failures should  not  result in  one-­‐off  patches

  76. Survival requires  systematic deliberation  &  design

  77. Survival requires  systematic trials  &  testing

  78. production is  not development

  79. don’t  you forget  it !