Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Development was the easy part (RubyNation 2015)

Development was the easy part (RubyNation 2015)

As developers, we spend our time in development environments, but our users only see our code in production. The differences between development and production can be extremely surprising, and surprises are bad when your users and revenue are at stake. I'll talk about the most common surprises of production environments, including metrics, networks, hardware, and data stores. By the end of the talk, you'll know how to use your existing tools to monitor and even improve your production environment, as well as gotchas to watch out for.

André Arko

June 12, 2015
Tweet

More Decks by André Arko

Other Decks in Technology

Transcript

  1. Development was
    the easy part

    View full-size slide

  2. Development was
    the easy part

    View full-size slide

  3. André Arko
    @indirect

    View full-size slide

  4. The Ruby Way
    therubyway.io

    View full-size slide

  5. Development
    is very different

    View full-size slide

  6. from
    Production

    View full-size slide


  7. you later →

    View full-size slide

  8. Metrics
    are important

    View full-size slide

  9. Metrics
    tell you what
    is happening

    View full-size slide

  10. Metrics
    convince you
    you understand

    View full-size slide

  11. Averages
    convince you
    you understand

    View full-size slide

  12. Averages
    are lie-candy
    for your brain

    View full-size slide

  13. Averages
    5
    -5 -4 -3 -2 -1 0 1 2 3 4
    0
    0.1
    0.2
    0.3
    0.4

    View full-size slide

  14. Averages
    5
    -5 -4 -3 -2 -1 0 1 2 3 4
    0
    0.1
    0.2
    0.3
    0.4

    View full-size slide


  15. just heard
    “we have
    a great
    average” →

    View full-size slide

  16. Averages
    mask problems

    View full-size slide

  17. 10
    0 1 2 3 4 5 6 7 8 9
    250
    0
    50
    100
    150
    200

    View full-size slide

  18. Graph
    the median

    View full-size slide

  19. 10
    0 1 2 3 4 5 6 7 8 9
    250
    0
    50
    100
    150
    200

    View full-size slide

  20. Graph
    95th percentile

    View full-size slide

  21. 10
    0 1 2 3 4 5 6 7 8 9
    250
    0
    50
    100
    150
    200

    View full-size slide

  22. Graph
    99th percentile

    View full-size slide

  23. 10
    0 1 2 3 4 5 6 7 8 9
    1000
    0
    250
    500
    750

    View full-size slide

  24. Aggregate graphs
    another kind
    of average

    View full-size slide

  25. Breakout graphs
    see individual
    variations

    View full-size slide

  26. Aggregate alerts
    more dead servers
    than alive servers

    View full-size slide


  27. site’s up if any
    servers are up!

    View full-size slide

  28. Breakout alerts
    first dead server
    not all the servers

    View full-size slide

  29. you have no idea
    what is going on
    Servers

    View full-size slide

  30. your app has this
    Routing

    View full-size slide

  31. how does it work?
    Routing

    View full-size slide

  32. Development
    App
    You

    View full-size slide

  33. People Router
    Server
    App
    App
    Router
    Server
    App
    App
    Router
    Production

    View full-size slide

  34. how slow is it?
    Routing

    View full-size slide

  35. does it back up?
    Routing

    View full-size slide

  36. not the time
    you measure
    Request

    View full-size slide

  37. wall-clock time
    from real clients
    Request

    View full-size slide

  38. make requests from
    around the world
    Request

    View full-size slide

  39. graph them
    Request

    View full-size slide

  40. graph them
    alert on them
    Request

    View full-size slide

  41. graph them
    alert on them
    thank me later
    Request

    View full-size slide

  42. do you have it?
    VM lag

    View full-size slide

  43. do you check for it?
    VM lag

    View full-size slide

  44. do you know how
    to check for it?
    VM lag

    View full-size slide

  45. how do you tell you
    lost consciousness?
    Runtime

    View full-size slide

  46. do you have it?
    Runtime

    View full-size slide

  47. do you have it?
    you have it.
    Runtime

    View full-size slide

  48. do you have it?
    you have it.
    how bad is it?
    Runtime

    View full-size slide

  49. in production
    Data Stores

    View full-size slide

  50. in production
    are distributed
    Data Stores

    View full-size slide

  51. what does
    that mean?

    View full-size slide

  52. your experience
    (so far) is wrong

    View full-size slide

  53. tries to save
    your data
    Saving data

    View full-size slide

  54. might save
    your data
    Saving data

    View full-size slide

  55. is not data-
    saving magic
    Replication

    View full-size slide

  56. tries to save
    your data…
    Replication

    View full-size slide

  57. tries to save
    your data…
    repeatedly
    Replication

    View full-size slide

  58. totally safe, right?
    Postgres

    View full-size slide

  59. async replication
    Postgres

    View full-size slide

  60. network failures
    lose “saved” data
    Postgres

    View full-size slide

  61. is single-threaded
    Redis

    View full-size slide

  62. has no failover
    Redis

    View full-size slide

  63. Mongo
    replication sets
    default to one write

    View full-size slide

  64. Mongo
    demand N copies
    survive N-1 failures

    View full-size slide

  65. trust no one

    View full-size slide

  66. if you didn’t
    try it yourself
    you are guessing

    View full-size slide

  67. test it yourself

    View full-size slide

  68. So, in the end
    what did we learn?

    View full-size slide

  69. Production
    is fundamentally

    View full-size slide

  70. Production
    is fundamentally
    systemically

    View full-size slide

  71. Production
    is fundamentally
    systemically
    different

    View full-size slide

  72. Failures
    will happen

    View full-size slide

  73. Failures
    can be resisted

    View full-size slide

  74. Failures
    should not result
    in one-off patches

    View full-size slide

  75. Survival
    requires systematic
    trials & testing

    View full-size slide

  76. Good luck!
    %☁

    View full-size slide