Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Development was the easy part (RubyNation 2015)

Development was the easy part (RubyNation 2015)

As developers, we spend our time in development environments, but our users only see our code in production. The differences between development and production can be extremely surprising, and surprises are bad when your users and revenue are at stake. I'll talk about the most common surprises of production environments, including metrics, networks, hardware, and data stores. By the end of the talk, you'll know how to use your existing tools to monitor and even improve your production environment, as well as gotchas to watch out for.

André Arko

June 12, 2015
Tweet

More Decks by André Arko

Other Decks in Technology

Transcript

  1. Development was
    the easy part

    View Slide

  2. Development was
    the easy part

    View Slide

  3. André Arko
    @indirect

    View Slide

  4. View Slide

  5. The Ruby Way
    therubyway.io

    View Slide

  6. stickers!

    View Slide

  7. View Slide

  8. stickers!

    View Slide

  9. View Slide

  10. stickers!

    View Slide

  11. Development
    is very different

    View Slide

  12. from
    Production

    View Slide


  13. you rn →

    View Slide


  14. you later →

    View Slide

  15. Metrics

    View Slide

  16. Metrics
    are important

    View Slide

  17. Metrics
    tell you what
    is happening

    View Slide

  18. Metrics
    convince you
    you understand

    View Slide

  19. Averages
    convince you
    you understand

    View Slide

  20. Averages
    are lie-candy
    for your brain

    View Slide

  21. Averages
    5
    -5 -4 -3 -2 -1 0 1 2 3 4
    0
    0.1
    0.2
    0.3
    0.4

    View Slide

  22. Averages
    5
    -5 -4 -3 -2 -1 0 1 2 3 4
    0
    0.1
    0.2
    0.3
    0.4

    View Slide

  23. View Slide

  24. View Slide


  25. just heard
    “we have
    a great
    average” →

    View Slide

  26. Averages
    mask problems

    View Slide

  27. 10
    0 1 2 3 4 5 6 7 8 9
    250
    0
    50
    100
    150
    200

    View Slide

  28. Graph
    the median

    View Slide

  29. 10
    0 1 2 3 4 5 6 7 8 9
    250
    0
    50
    100
    150
    200

    View Slide

  30. Graph
    95th percentile

    View Slide

  31. 10
    0 1 2 3 4 5 6 7 8 9
    250
    0
    50
    100
    150
    200

    View Slide

  32. Graph
    99th percentile

    View Slide

  33. 10
    0 1 2 3 4 5 6 7 8 9
    1000
    0
    250
    500
    750

    View Slide

  34. Aggregate graphs
    another kind
    of average

    View Slide

  35. View Slide

  36. Breakout graphs
    see individual
    variations

    View Slide

  37. View Slide

  38. View Slide

  39. Aggregate alerts
    more dead servers
    than alive servers

    View Slide


  40. site’s up if any
    servers are up!

    View Slide

  41. Breakout alerts
    first dead server
    not all the servers

    View Slide

  42. Servers

    View Slide

  43. you have no idea
    what is going on
    Servers

    View Slide

  44. really.

    View Slide

  45. Routing

    View Slide

  46. your app has this
    Routing

    View Slide

  47. how does it work?
    Routing

    View Slide

  48. Development
    App
    You

    View Slide

  49. People Router
    Server
    App
    App
    Router
    Server
    App
    App
    Router
    Production

    View Slide

  50. how slow is it?
    Routing

    View Slide

  51. does it back up?
    Routing

    View Slide

  52. Request

    View Slide

  53. not the time
    you measure
    Request

    View Slide

  54. wall-clock time
    from real clients
    Request

    View Slide

  55. make requests from
    around the world
    Request

    View Slide

  56. graph them
    Request

    View Slide

  57. graph them
    alert on them
    Request

    View Slide

  58. graph them
    alert on them
    thank me later
    Request

    View Slide

  59. VM lag

    View Slide

  60. do you have it?
    VM lag

    View Slide

  61. do you check for it?
    VM lag

    View Slide

  62. do you know how
    to check for it?
    VM lag

    View Slide

  63. Runtime

    View Slide

  64. how do you tell you
    lost consciousness?
    Runtime

    View Slide

  65. do you have it?
    Runtime

    View Slide

  66. do you have it?
    you have it.
    Runtime

    View Slide

  67. do you have it?
    you have it.
    how bad is it?
    Runtime

    View Slide

  68. Data Stores

    View Slide

  69. in production
    Data Stores

    View Slide

  70. in production
    are distributed
    Data Stores

    View Slide

  71. what does
    that mean?

    View Slide

  72. your experience
    (so far) is wrong

    View Slide

  73. Saving data

    View Slide

  74. tries to save
    your data
    Saving data

    View Slide

  75. might save
    your data
    Saving data

    View Slide

  76. Replication

    View Slide

  77. is not data-
    saving magic
    Replication

    View Slide

  78. tries to save
    your data…
    Replication

    View Slide

  79. tries to save
    your data…
    repeatedly
    Replication

    View Slide

  80. Postgres

    View Slide

  81. totally safe, right?
    Postgres

    View Slide

  82. async replication
    Postgres

    View Slide

  83. network failures
    lose “saved” data
    Postgres

    View Slide

  84. Redis

    View Slide

  85. is single-threaded
    Redis

    View Slide

  86. has no failover
    Redis

    View Slide

  87. Mongo
    replication sets
    default to one write

    View Slide

  88. Mongo
    demand N copies
    survive N-1 failures

    View Slide

  89. trust no one

    View Slide

  90. if you didn’t
    try it yourself
    you are guessing

    View Slide

  91. test it yourself

    View Slide

  92. So, in the end
    what did we learn?

    View Slide

  93. Production
    is fundamentally

    View Slide

  94. Production
    is fundamentally
    systemically

    View Slide

  95. Production
    is fundamentally
    systemically
    different

    View Slide

  96. Failures
    will happen

    View Slide

  97. Failures
    can be resisted

    View Slide

  98. Failures
    should not result
    in one-off patches

    View Slide

  99. Survival
    requires systematic
    trials & testing

    View Slide

  100. Good luck!
    %☁

    View Slide