Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Development is just the tip of the iceberg

André Arko
October 10, 2013

Development is just the tip of the iceberg

Delivered at DevCon TLV in Tel Aviv, Israel, this talk focuses on some of the common ways that production environments are different from development environments. Even though developers interact with development 90% of the time, users are _always_ interacting with code in production. It's incredibly important to remember not just that they aren't the same, but how they are different. Armed with that knowledge, we can make tradeoffs in production that produce the best results possible for our particular software.

André Arko

October 10, 2013
Tweet

More Decks by André Arko

Other Decks in Technology

Transcript

  1. Development
    is  just  the  tip
    of  the  iceberg

    View Slide

  2. André  Arko
    @indirect
     
     

    View Slide

  3. View Slide

  4. View Slide

  5. DANGER
    PRODUCTION  AHEAD

    View Slide

  6. Metrics

    View Slide

  7. Metrics
    are  important

    View Slide

  8. Metrics
    tell  you  what
    is  happening

    View Slide

  9. Metrics
    convince  you
    you  understand

    View Slide

  10. Averages
    convince  you
    you  understand

    View Slide

  11. but  brains  are
    pretty  weird

    View Slide

  12. you  probably  don’t
    understand  averages

    View Slide

  13. Average
    (right?)

    View Slide

  14. View Slide

  15. View Slide

  16. !

    View Slide

  17. Averages
    mask  problems

    View Slide

  18. Averages
    !

    View Slide

  19. Instead
    graph  the  full
    distribution

    View Slide

  20. Instead
    graph  median,
    mean,  and  95th

    View Slide

  21. Aggregates
    another  kind
    of  average

    View Slide

  22. Srsly  guise
    breakout  graphs

    View Slide

  23. Srsly  guise
    alert  on  broken-­‐
    out  metrics

    View Slide

  24. Srsly  guise
    alerts  on  aggregates
    are  probably  too  late

    View Slide

  25. Servers

    View Slide

  26. Servers
    you  have  no  idea  what  is  
    going  on

    View Slide

  27. really.

    View Slide

  28. it’s  3am.
    do  you  know  where  your  
    application  is?

    View Slide

  29. Routing
    your  app  has  this

    View Slide

  30. Routing
    how  slow  is  it?

    View Slide

  31. Routing
    does  it  back  up?

    View Slide

  32. Request  time

    View Slide

  33. Request  time
    not  your  metrics,
    I  mean  for  real

    View Slide

  34. Request  time
    make  requests
    from  all  over  

    View Slide

  35. Request  time
    graph  them

    View Slide

  36. Request  time
    graph  them
    alert  on  them  

    View Slide

  37. Request  time
    graph  them
    alert  on  them
    thank  me  later

    View Slide

  38. Runtime  lag

    View Slide

  39. Runtime  lag
    (how  do  you  tell  you  lost  
    consciousness?)

    View Slide

  40. Runtime  lag
    do  you  have  it?

    View Slide

  41. Runtime  lag
    do  you  have  it?
    (yes)

    View Slide

  42. Runtime  lag
    how  bad  is  it?

    View Slide

  43. Runtime  lag
    how  do  you  track  it?

    View Slide

  44. VM  lag

    View Slide

  45. VM  lag
    do  you  have  it?

    View Slide

  46. VM  lag
    do  you  even
    check  for  it?

    View Slide

  47. VM  lag
    do  you  know  how
    to  check  for  it?

    View Slide

  48. Data  stores

    View Slide

  49. Data  stores
    in  production

    View Slide

  50. Data  stores
    in  production
    are  distributed

    View Slide

  51. what  does
    that  mean?

    View Slide

  52. your  experience
    (so  far)  is  wrong

    View Slide

  53. Saving  data

    View Slide

  54. Saving  data
    tries  to  save
    your  data

    View Slide

  55. Saving  data
    might  save
    your  data

    View Slide

  56. Replication

    View Slide

  57. Replication
    doesn’t  save  you

    View Slide

  58. Postgres
    async  replication

    View Slide

  59. Postgres
    network  failures
    can  lose  saved  data

    View Slide

  60. Redis
    has  no  failover

    View Slide

  61. Redis-­‐sentinel
    elects  a  new  leader

    View Slide

  62. Redis-­‐sentinel
    keeps  one  leader’s  saves  
    during  failures

    View Slide

  63. Mongo
    returns  before
    the  first  write

    View Slide

  64. Mongo
    your  data  is  on
    zero  disks  (so  far)

    View Slide

  65. Mongo
    demand  N  copies
    survive  N-­‐1  failures

    View Slide

  66. trust  no  one

    View Slide

  67. if  you  didn’t  try  it,
    you  are  guessing

    View Slide

  68. try  it  yourself

    View Slide

  69. So
    what  did  we  learn?

    View Slide

  70. Production
    is  fundamentally

    View Slide

  71. Production
    is  fundamentally
    systemically

    View Slide

  72. Production
    is  fundamentally
    systemically
    different

    View Slide

  73. Failures
    will  happen

    View Slide

  74. Failures
    can  be  resisted

    View Slide

  75. Failures
    should  not  result
    in  one-­‐off  patches

    View Slide

  76. Survival
    requires  systematic
    deliberation  &  design

    View Slide

  77. Survival
    requires  systematic
    trials  &  testing

    View Slide

  78. production
    is  not
    development

    View Slide

  79. don’t  you
    forget  it
    !

    View Slide