Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Designing For Disaster - Preparing Your Code For Emergencies

Designing For Disaster - Preparing Your Code For Emergencies

If there's a problem in production, would you know it had happened? How would you find out about it? How would you piece together what had happened? And how would you get the fix out the door?

In this talk, Stuart will look at how you can prepare your web app for when disaster strikes. He'll show you what information you need, and how to build that into your app as you go. He'll cover how to use that information to investigate and resolve problems, and give you some food for thought about releasing during an emergency.

Presented at the PHP North West user group in Manchester on 5th June, 2018.

2c1dc90ff7bf69097a151677624777d2?s=128

Stuart Herbert

June 05, 2018
Tweet

Transcript

  1. A presentation by @stuherbert
 for @GanbaroDigital Designing For Disaster Prepare

    Your Code For Emergencies
  2. Industry veteran: architect, engineer, leader, manager, mentor F/OSS contributor since

    1994 Talking and writing about PHP since 2004 Chief Software Archaeologist Building Quality @GanbaroDigital About Stuart
  3. Follow me I do tweet a lot about non-tech stuff

    though :) @stuherbert
  4. @GanbaroDigital This is a follow-on from my EdPUG talk from

    June 2017.
  5. @GanbaroDigital Ganbaro Digital on YouTube.com

  6. @GanbaroDigital I’m here to talk to you about disasters in

    Prod ...
  7. @GanbaroDigital ... and how I prepare my code for them.

  8. @GanbaroDigital “An emergency is the worst possible time to be

    fixing processes and practices.
  9. @GanbaroDigital Three Desired Outcomes 1. Know That There Is A

    Problem 2. Know What The Problem is 3. Make Handling Disasters Normal
  10. @GanbaroDigital “ Guesswork is for amateurs.

  11. @GanbaroDigital “ Assumptions are the mother of all screwups.

  12. @GanbaroDigital “ Under pressure, people revert to habit.

  13. @GanbaroDigital This is just my experience. I’m here to learn

    from you too!
  14. @GanbaroDigital Know That There Is A Problem

  15. @GanbaroDigital Passive Monitoring

  16. @GanbaroDigital ?? ?? How do you find out if something’s

    wrong in Prod?
  17. @GanbaroDigital https://unsplash.com/photos/28v9cq7ytNU Everything On Fire

  18. @GanbaroDigital ?? ?? Do you use your website / app

    regularly enough to notice when things are down?
  19. @GanbaroDigital https://flic.kr/p/83Z8Kp Angry Customers!

  20. @GanbaroDigital ?? ?? Will customers tell you there’s a problem

    ... ... or will they simply go elsewhere?
  21. @GanbaroDigital ?? ?? Are you aware when customers do report

    problems?
  22. @GanbaroDigital Who Talks To Customers? • ... different part of

    the building • ... different building • ... different timezone • ... different agenda
  23. @GanbaroDigital Who Talks To Customers? • ... different part of

    the building • ... different building • ... different timezone • ... different agenda
  24. @GanbaroDigital Who Talks To Customers? • ... different part of

    the building • ... different building • ... different timezone • ... different agenda
  25. @GanbaroDigital Who Talks To Customers? • ... different part of

    the building • ... different building • ... different timezone • ... different agenda
  26. @GanbaroDigital “ To reliably know when there’s a problem we

    need to cut out the middle man.
  27. @GanbaroDigital https://unsplash.com/photos/LkD_IH8_K8k Monitoring

  28. @GanbaroDigital You might be thinking ... “isn’t monitoring a devops*

    responsibility?”
  29. @GanbaroDigital * devops is not a job title!

  30. @GanbaroDigital Log Monitoring

  31. @GanbaroDigital ?? ?? Do you monitor your logs?

  32. @GanbaroDigital I don’t (any more).

  33. @GanbaroDigital Modern-day log messages are fundamentally unsuited for monitoring.

  34. @GanbaroDigital “ As an industry, we have stopped treating log

    messages as part of our interfaces.
  35. @GanbaroDigital Log messages aren’t documented these days.

  36. @GanbaroDigital Log messages, they come and they go from version

    to version.
  37. @GanbaroDigital “ Logs aren’t machine readable.

  38. @GanbaroDigital Structured log messages are rare.

  39. @GanbaroDigital Even structured logs aren’t additive facts. They’re statements.

  40. @GanbaroDigital “ Logs aren’t reliable.

  41. @GanbaroDigital ?? ?? If you stopped receiving log messages, would

    your monitoring notice?
  42. @GanbaroDigital During emergencies, logs get lost.

  43. @GanbaroDigital Logs are transient. At scale, it’s not practical to

    keep them for very long.
  44. @GanbaroDigital ... and, at scale, we have to reduce logging

    to avoid IO starvation.
  45. @GanbaroDigital ?? ?? What can you monitor?

  46. @GanbaroDigital “ Use metrics for monitoring.

  47. @GanbaroDigital Metrics

  48. @GanbaroDigital Metrics have: a hierarchical name, and a numerical value.

  49. @GanbaroDigital Common Metrics • Counters - how many times has

    an event happened? • Timers - how long is something taking to happen?
  50. @GanbaroDigital Some metrics can come from monitoring log files.

  51. @GanbaroDigital Web Server Metrics • Incoming requests • Response codes

    (HTTP 500 etc) • Response timings • Response sizes
  52. @GanbaroDigital Your web server doesn’t understand your app. Or your

    business.
  53. @GanbaroDigital Build your apps to emit metrics that mean something

    to your business.
  54. @GanbaroDigital composer require league/statsd

  55. @GanbaroDigital https://www.slideshare.net/brianbrazil/an-introduction-to- prometheus-grafanacon-2016

  56. @GanbaroDigital https://unsplash.com/photos/hOf9BaYUN88 Dashboards

  57. @GanbaroDigital https://flic.kr/p/7iMNDQ Warning Lights

  58. @GanbaroDigital https://unsplash.com/photos/dLTpk6N31Fc Alarms!

  59. @GanbaroDigital If we’re not monitoring app logs, what’s the point

    of them?
  60. @GanbaroDigital Know What The Problem Is

  61. @GanbaroDigital Why We Log

  62. @GanbaroDigital ?? ?? Do you rely on intuitive leaps to

    solve problems in Prod?
  63. @GanbaroDigital Intuition works well ... ... until intuition finds a

    better paid job somewhere else!
  64. @GanbaroDigital “ Intuition-driven production problem solving is not sustainable.

  65. @GanbaroDigital Someone is going to be left holding the baby.

    That someone might be future you.
  66. @GanbaroDigital ?? ?? What can we replace intuition with?

  67. @GanbaroDigital https://flic.kr/p/cwK7UN Science!

  68. @GanbaroDigital https://flic.kr/p/9gh9EA Evidence!

  69. @GanbaroDigital https://unsplash.com/photos/nYIQYg8cQVc Follow The Evidence!

  70. @GanbaroDigital https://unsplash.com/photos/pe_R74hldW4 Clarity

  71. @GanbaroDigital https://flic.kr/p/9b24xc We Want To See The Internals

  72. @GanbaroDigital “ Use logs for investigations.

  73. @GanbaroDigital Consumable Logs

  74. @GanbaroDigital Log collection and management is based on UNIX syslog.

  75. @GanbaroDigital 1 line of text == 1 complete log message

  76. @GanbaroDigital PHP Fatal error: Uncaught Exception: Something went wrong in

    /tmp/ test.php:3 Stack trace: #0 {main} thrown in /tmp/test.php on line 3
  77. @GanbaroDigital Use a custom formatter in Monolog.

  78. @GanbaroDigital Convert line breaks and tabs into \n and \t

    strings.
  79. @GanbaroDigital

  80. @GanbaroDigital Structured Logs

  81. @GanbaroDigital

  82. @GanbaroDigital datetime || log-level || message || data || IDs

  83. @GanbaroDigital Datetime is complicated. Common practice is to log everything

    in UTC to create consistent timelines.
  84. @GanbaroDigital datetime || log-level || message || data || IDs

  85. @GanbaroDigital This has been standardised for decades. Don’t invent your

    own.
  86. @GanbaroDigital 1. emerg 2. alert 3. crit 4. err 5.

    warning 6. notice 7. info 8. debug RFC 5424 Log Levels
  87. @GanbaroDigital datetime || log-level || message || data || IDs

  88. @GanbaroDigital vnsprintf() format strings work really well for this.

  89. @GanbaroDigital

  90. @GanbaroDigital datetime || log-level || message || data || IDs

  91. @GanbaroDigital Make this the params array used to build the

    ‘message’ field ... + any additional items you need.
  92. @GanbaroDigital datetime || log-level || message || data || IDs

  93. @GanbaroDigital ?? ?? Can you identify the relevant log messages?

  94. @GanbaroDigital Three Tracking Tokens 1. All log messages from a

    single business process 2. All log messages from a single request 3. All log messages from a single user
  95. @GanbaroDigital X-Request-ID GUID Header passed in by the API user

  96. @GanbaroDigital (Request) UID Random hash Generated at the start of

    each PHP request.
  97. @GanbaroDigital

  98. @GanbaroDigital Auth token User ID || OAuth Token You need

    to know who did what.
  99. @GanbaroDigital This structure works well with the search / filter

    features in log management tools.
  100. @GanbaroDigital Make Handling Disasters Normal

  101. @GanbaroDigital Are we prepared?

  102. @GanbaroDigital “ Under pressure, people revert to habit.

  103. @GanbaroDigital Build those habits into your daily dev work.

  104. @GanbaroDigital Debugging In Dev

  105. @GanbaroDigital ?? ?? How many of you use Xdebug during

    development?
  106. @GanbaroDigital ?? ?? How many of you run Xdebug in

    prod?
  107. @GanbaroDigital You can’t run XDebug in prod. It’s just not

    practical.
  108. @GanbaroDigital ?? ?? What would you use when you can’t

    use Xdebug?
  109. @GanbaroDigital Use metrics and logs to debug in development.

  110. @GanbaroDigital Effective logging is an iterative process. Start iterating in

    dev.
  111. @GanbaroDigital Logs complement your tests. They’re not a replacement.

  112. @GanbaroDigital Shipping Under Pressure

  113. @GanbaroDigital ?? ?? How quickly can you ship to prod?

  114. @GanbaroDigital ?? ?? What can you safely bypass when you

    have to ship in an emergency?
  115. @GanbaroDigital ?? ?? Which day-to-day safeguards will stop you shipping

    in an emergency?
  116. @GanbaroDigital ?? ?? Do you have the ability to bypass

    the things you need to?
  117. @GanbaroDigital ?? ?? Can you force your emergency build to

    top priority on your CI box?
  118. @GanbaroDigital ?? ?? Can you build and ship from your

    dev box if you had to?
  119. @GanbaroDigital ?? ?? How would you tailor your testing?

  120. @GanbaroDigital Just some food for thought.

  121. @GanbaroDigital

  122. Thank You Any Questions? A presentation by @stuherbert
 for @GanbaroDigital