Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Development Was the Easy Part (MagmaConf 2014)

Development Was the Easy Part (MagmaConf 2014)

While writing code that actually works is (in all likelihood) impossible, there are other things that are even harder. This talk is about one of those: production environments. As a developer, you spend almost all of your time in development. That's a shame, because your code will be tested orders of magnitude more times in production. As tautological as it is to say, development and production are incredibly different places. This talk will try to make you surprised and uncomfortable about your own production environment. We'll focus on the ways that metrics, networks, and data stores can be surprising or easy to misunderstand… until it's 3am and your phone won't stop ringing.

André Arko

June 05, 2014
Tweet

More Decks by André Arko

Other Decks in Technology

Transcript

  1. Development was
    the easy part

    View full-size slide

  2. André Arko
    @indirect

    View full-size slide

  3. Development
    is very different

    View full-size slide

  4. from
    Production

    View full-size slide


  5. you later →

    View full-size slide

  6. Metrics
    are important

    View full-size slide

  7. Metrics
    tell you what
    is happening

    View full-size slide

  8. Metrics
    convince you
    you understand

    View full-size slide

  9. Averages
    convince you
    you understand

    View full-size slide

  10. Averages
    are lie-candy
    for your brain

    View full-size slide

  11. Averages
    5
    -5 -4 -3 -2 -1 0 1 2 3 4
    0
    0.1
    0.2
    0.3
    0.4

    View full-size slide

  12. Averages
    5
    -5 -4 -3 -2 -1 0 1 2 3 4
    0
    0.1
    0.2
    0.3
    0.4

    View full-size slide


  13. just heard
    “w
    e
    have
    a
    great average” →

    View full-size slide

  14. Averages
    mask problems

    View full-size slide

  15. 10
    0 1 2 3 4 5 6 7 8 9
    250
    0
    50
    100
    150
    200

    View full-size slide

  16. Graph
    the median

    View full-size slide

  17. 10
    0 1 2 3 4 5 6 7 8 9
    250
    0
    50
    100
    150
    200

    View full-size slide

  18. Graph
    95th percentile

    View full-size slide

  19. 10
    0 1 2 3 4 5 6 7 8 9
    250
    0
    50
    100
    150
    200

    View full-size slide

  20. Graph
    99th percentile

    View full-size slide

  21. 10
    0 1 2 3 4 5 6 7 8 9
    1000
    0
    250
    500
    750

    View full-size slide

  22. Aggregate graphs
    another kind
    of average

    View full-size slide

  23. Breakout graphs
    see individual
    variations

    View full-size slide

  24. Aggregate alerts
    more dead servers
    than alive servers

    View full-size slide


  25. site’s up if any
    servers are up!

    View full-size slide

  26. Breakout alerts
    first dead server
    not all the servers

    View full-size slide

  27. Servers
    you have no idea
    what is going on

    View full-size slide

  28. Routing
    your app has this

    View full-size slide

  29. Routing
    how does it work?

    View full-size slide

  30. Development
    App
    You

    View full-size slide

  31. Production
    People Router
    Server
    App
    App
    Router
    Server
    App
    App
    Router

    View full-size slide

  32. Routing
    how slow is it?

    View full-size slide

  33. Routing
    does it back up?

    View full-size slide

  34. Request time

    View full-size slide

  35. Request time
    not the time
    you measure

    View full-size slide

  36. Request time
    wall-clock time
    from real clients

    View full-size slide

  37. Request time
    make requests from
    around the world

    View full-size slide

  38. VM lag
    do you have it?

    View full-size slide

  39. VM lag
    do you check for it?

    View full-size slide

  40. VM lag
    do you know how
    to check for it?

    View full-size slide

  41. Runtime lag
    how do you tell you
    lost consciousness?

    View full-size slide

  42. Runtime lag
    do you have it?

    View full-size slide

  43. Runtime lag
    do you have it?
    you have it.

    View full-size slide

  44. Runtime lag
    do you have it?
    you have it.
    how bad is it?

    View full-size slide

  45. Data stores
    in production

    View full-size slide

  46. Data stores
    in production
    are distributed

    View full-size slide

  47. what does
    that mean?

    View full-size slide

  48. your experience
    (so far) is wrong

    View full-size slide

  49. Saving data
    tries to save
    your data

    View full-size slide

  50. Saving data
    might save
    your data

    View full-size slide

  51. Replication
    is not data-
    saving magic

    View full-size slide

  52. Replication
    tries to save
    your data…

    View full-size slide

  53. Replication
    tries to save
    your data…
    repeatedly

    View full-size slide

  54. Postgres
    totally safe, right?

    View full-size slide

  55. Postgres
    async replication

    View full-size slide

  56. Postgres
    network failures
    lose “saved” data

    View full-size slide

  57. Redis
    is single-threaded

    View full-size slide

  58. Redis
    has no failover

    View full-size slide

  59. Redis-sentinel
    elects a new leader

    View full-size slide

  60. Redis-sentinel
    throws away non-
    winners’ writes

    View full-size slide

  61. Mongo (gem < 1.8)
    returns before
    the first write

    View full-size slide

  62. Mongo (gem < 1.8)
    your data is on
    zero disks so far

    View full-size slide

  63. Mongo
    replication sets
    default to one write

    View full-size slide

  64. Mongo
    demand N copies
    survive N-1 failures

    View full-size slide

  65. trust no one

    View full-size slide

  66. if you didn’t try it
    you are guessing

    View full-size slide

  67. test it yourself

    View full-size slide

  68. what did we learn?
    So, in the end

    View full-size slide

  69. Production
    is fundamentally

    View full-size slide

  70. Production
    is fundamentally
    systemically

    View full-size slide

  71. Production
    is fundamentally
    systemically
    different

    View full-size slide

  72. Failures
    will happen

    View full-size slide

  73. Failures
    can be resisted

    View full-size slide

  74. Failures
    should not result
    in one-off patches

    View full-size slide

  75. Survival
    requires systematic
    trials & testing

    View full-size slide

  76. Development
    is not like
    production

    View full-size slide