$30 off During Our Annual Pro Sale. View Details »

Development Was the Easy Part (Scottish Ruby 2014)

Development Was the Easy Part (Scottish Ruby 2014)

While writing code that actually works is (in all likelihood) impossible, there are other things that are even harder. This talk is about one of those: production environments. As a developer, you spend almost all of your time in development. That's a shame, because your code will be tested orders of magnitude more times in production. As tautological as it is to say, development and production are incredibly different places. This talk will try to make you surprised and uncomfortable about your own production environment. We'll focus on the ways that metrics, networks, and data stores can be surprising or easy to misunderstand… until it's 3am and your phone won't stop ringing.

André Arko

May 13, 2014
Tweet

More Decks by André Arko

Other Decks in Technology

Transcript

  1. Development was
    the easy part

    View Slide

  2. André Arko
    @indirect

    View Slide

  3. View Slide

  4. View Slide

  5. Development
    is very different

    View Slide

  6. from
    Production

    View Slide


  7. you rn →

    View Slide


  8. you later →

    View Slide

  9. Metrics

    View Slide

  10. Metrics
    are important

    View Slide

  11. Metrics
    tell you what
    is happening

    View Slide

  12. Metrics
    convince you
    you understand

    View Slide

  13. Averages
    convince you
    you understand

    View Slide

  14. Averages
    are lie-candy
    for your brain

    View Slide

  15. Averages
    5
    -5 -4 -3 -2 -1 0 1 2 3 4
    0
    0.1
    0.2
    0.3
    0.4

    View Slide

  16. Averages
    5
    -5 -4 -3 -2 -1 0 1 2 3 4
    0
    0.1
    0.2
    0.3
    0.4

    View Slide

  17. View Slide

  18. View Slide


  19. just heard “we have
    a great average” →

    View Slide

  20. Averages
    mask problems

    View Slide

  21. 10
    0 1 2 3 4 5 6 7 8 9
    250
    0
    50
    100
    150
    200

    View Slide

  22. Graph
    the median

    View Slide

  23. 10
    0 1 2 3 4 5 6 7 8 9
    250
    0
    50
    100
    150
    200

    View Slide

  24. Graph
    95th percentile

    View Slide

  25. 10
    0 1 2 3 4 5 6 7 8 9
    250
    0
    50
    100
    150
    200

    View Slide

  26. Graph
    99th percentile

    View Slide

  27. 10
    0 1 2 3 4 5 6 7 8 9
    1000
    0
    250
    500
    750

    View Slide

  28. Aggregate graphs
    another kind
    of average

    View Slide

  29. View Slide

  30. Breakout graphs
    see individual
    variations

    View Slide

  31. View Slide

  32. View Slide

  33. Aggregate alerts
    more dead servers
    than alive servers

    View Slide


  34. site’s up if any
    servers are up!

    View Slide

  35. Breakout alerts
    first dead server
    not all the servers

    View Slide

  36. Servers

    View Slide

  37. Servers
    you have no idea
    what is going on

    View Slide

  38. really.

    View Slide

  39. Routing

    View Slide

  40. Routing
    your app has this

    View Slide

  41. Routing
    how does it work?

    View Slide

  42. Development
    App
    You

    View Slide

  43. Production
    People Router
    Server
    App
    App
    Router
    Server
    App
    App
    Router

    View Slide

  44. Routing
    how slow is it?

    View Slide

  45. Routing
    does it back up?

    View Slide

  46. Request time

    View Slide

  47. Request time
    not the time
    you measure

    View Slide

  48. Request time
    wall-clock time
    from real clients

    View Slide

  49. Request time
    make requests from
    around the world

    View Slide

  50. Request time
    graph them

    View Slide

  51. Request time
    graph them
    alert on them

    View Slide

  52. Request time
    graph them
    alert on them
    thank me later

    View Slide

  53. VM lag

    View Slide

  54. VM lag
    do you have it?

    View Slide

  55. VM lag
    do you check for it?

    View Slide

  56. VM lag
    do you know how
    to check for it?

    View Slide

  57. Runtime lag

    View Slide

  58. Runtime lag
    how do you tell you
    lost consciousness?

    View Slide

  59. Runtime lag
    do you have it?

    View Slide

  60. Runtime lag
    do you have it?
    you have it.

    View Slide

  61. Runtime lag
    do you have it?
    you have it.
    how bad is it?

    View Slide

  62. Data stores

    View Slide

  63. Data stores
    in production

    View Slide

  64. Data stores
    in production
    are distributed

    View Slide

  65. what does
    that mean?

    View Slide

  66. your experience
    (so far) is wrong

    View Slide

  67. Saving data

    View Slide

  68. Saving data
    tries to save
    your data

    View Slide

  69. Saving data
    might save
    your data

    View Slide

  70. Replication

    View Slide

  71. Replication
    is not data-
    saving magic

    View Slide

  72. Replication
    tries to save
    your data…

    View Slide

  73. Replication
    tries to save
    your data…
    repeatedly

    View Slide

  74. Postgres

    View Slide

  75. Postgres
    totally safe, right?

    View Slide

  76. Postgres
    async replication

    View Slide

  77. Postgres
    network failures
    lose “saved” data

    View Slide

  78. Redis

    View Slide

  79. Redis
    is single-threaded

    View Slide

  80. Redis
    has no failover

    View Slide

  81. Redis-sentinel
    elects a new leader

    View Slide

  82. Redis-sentinel
    throws away non-
    winners’ writes

    View Slide

  83. Mongo (gem < 1.8)
    returns before
    the first write

    View Slide

  84. Mongo (gem < 1.8)
    your data is on
    zero disks so far

    View Slide

  85. Mongo
    replication sets
    default to one write

    View Slide

  86. Mongo
    demand N copies
    survive N-1 failures

    View Slide

  87. trust no one

    View Slide

  88. if you didn’t try it
    you are guessing

    View Slide

  89. test it yourself

    View Slide

  90. So, in the end
    what did we learn?

    View Slide

  91. Production
    is fundamentally

    View Slide

  92. Production
    is fundamentally
    systemically

    View Slide

  93. Production
    is fundamentally
    systemically
    different

    View Slide

  94. Failures
    will happen

    View Slide

  95. Failures
    can be resisted

    View Slide

  96. Failures
    should not result
    in one-off patches

    View Slide

  97. Survival
    requires systematic
    trials & testing

    View Slide

  98. Development
    is not like
    production

    View Slide