Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Disaster Averted

Disaster Averted

Fix problems before they happen with maximum visibility into your application, with practical steps on monitoring, metric collection, tracing & log analysis. To prevent and fix problems it's important to see how your application works from macro down to micro scale. This can be achieved using a variety of free and paid tools from multiple vendors.

The talk explores practical use of AWS CloudWatch, Datadog & New Relic, all of which can be applied across different cloud or on-premises providers as well as an overview of other available products such as Pagerduty, Loggly, Sentry and Tideways.

Mike Lehan

July 02, 2018

More Decks by Mike Lehan

Other Decks in Technology



    Availability ◉ Data consistency ◉ Security Assuming good practice is already followed in: ◉ Testing ◉ Backups ◉ Authentication
  2. Awareness We need to know if there’s an incident going

    on, spot if there are conditions leading up to an incident, and assure that things are working as expected WE WANT TOOLS WHICH PROVIDE US WITH Visibility If something isn’t right we need ways to isolate affected parts of the application, review relevant history, and deep-dive into specific areas to find the problem
  3. AWARENESS: THE EVERYTHING’S OK ALARM An alarm that doesn’t alarm

    isn’t a good alarm How often do you test your alarms?
  4. VISIBILITY: THE TOILET DOOR PROBLEM Did you actually lock the

    door? Does the lock work as intended? What we can’t see should make us nervous!
  5. Logging Usually aimed at providing Visibility, some tools use these

    to add Awareness Metrics Primarily give us Awareness, but can add to Visibility
  6. Alerting All about Awareness, require prior knowledge of what might

    go wrong Dashboards Generally about Awareness, can be used for Visibility Tracing/APM Next level Visibility beyond logs - see exactly what happened
  7. WHAT DOES CLOUDWATCH OFFER ◉ Metrics ◉ Dashboards ◉ Logging

    ◉ Alerts ◉ Events Not just for applications hosted on AWS As with all AWS products, pay-as-you-go with generous free tier
  8. GRAPH YOUR METRICS ◉ Stores data down to minute or

    second resolution ◉ 15 months retention, but less granular over time ◉ Map up to 1440 points per graph, aggregate Max, Min, Avg, Count, Sum and Percentiles Bonus: Let’s you play games to find unrelated metrics which correlate
  9. USEFUL METRICS FOR AWS USERS Load balancers Number of requests

    Target response times Number of 5xx codes EC2 Average/Max CPU utilization CPU credit balance RAM Usage? Er, nope... Autoscaling Number of instances (enable this as a special option) RDS CPU utilization Swap usage Burst credits (!!!) CPU credit balance EFS Disk I/O credits Size stored (costly!) Cloudfront Number of requests Bytes transferred Cache misses Number of 4xx codes
  10. Cloudwatch Agent Installs as a service on Linux servers -

    EC2 or your own server Configured to collect system level metrics - CPU, RAM, Disk, Network SYSTEM & APPLICATION METRICS PUT metrics API Accessible via “awscli” or SDK in PHP Submit custom metrics with your own namespace Requests can be made async
  11. APPLICATION METRICS ARE POWERFUL Constants What happens lots in your

    application that indicates things are basically working? Critical Are there key transactions that have a large bearing on your business case? Errors Generally the place of logs rather than metrics Can we turn error logs into metrics?

    your company use BA tools? Metrics can be better & cheaper Market research You may have valuable data that you’re not even aware of Fun Every team benefits from a chance to see what their application is doing.
  13. LOG AGGREGATION ◉ Log groups and streams - split by

    use and entity ◉ Automatic expiration - save money and reduce search overhead ◉ Search by text and date - across all streams in a group ◉ Filters turn logs into metrics - by pattern matching
  14. WHERE DO LOGS COME FROM? Internal AWS Most AWS services

    don’t provide logs in Cloudwatch RDS lets you export default logs Lambda logs natively CloudWatch Agent Two confusingly named agents Config file chooses files or directories to stream as logs Auto-rewrite config file to find new log files API/SDK You can push log entries directly. Not the best experience; API is intended to be used to push logs in batches
  15. WHICH LOGS ARE IMPORTANT? Awareness ◉ Auth log ◉ Apache

    error log ◉ MySQL error log ◉ MySQL audit log Visibility ◉ Syslog ◉ Dmesg / messages ◉ MySQL general log Both ◉ Apache access log ◉ MySQL slow query log ◉ Your application logs
  16. DON’T MISS A THING We have metrics & logs to

    provide us with awareness If someone isn’t watching our dashboards 24/7 we can still miss something

    Notification Service ◉ Notifications by email, SMS, webhook ◉ Emails & SMS are basic ◉ Notifications are one time ◉ Alerts are cheap - basically free for low volume ◉ Webhooks can send to external tools
  18. GENERAL RULE DATADOG vs CLOUDWATCH ◉ If Cloudwatch does it:

    Datadog does it & more ◉ Exception: metrics from AWS’ own services ◉ Different approach to log aggregation Cloudwatch Agent <-> Dogstatsd (Based on Statsd)
  19. HOW NEW RELIC COMPARES ◉ Most established tool but might

    be struggling to keep up ◉ All about the APM; no logging ◉ Similar agent approach to Datadog ◉ Clunky interface ◉ Also includes error monitoring as part of APM
  20. NEVER HEARD OF THIS ONE? ◉ Most single-use tool so

    far - just does error monitoring ◉ Logs via HTTP calls so would add minor overhead ◉ Integrates into PHP with composer ◉ Also integrates with JS via front end script
  21. WHERE DOES THIS FIT IN? ◉ Can be used to

    observe ongoing errors ◉ Assist handling of errors by the application ◉ Allows alerting ◉ Can feed event data into Datadog for unified dashboards
  22. ◉ Pure logging tool, can stream logs and understand AWS

    logs ◉ Cool online tail feature allows log observation in real time
  23. ◉ Alert managing tool, can feed from other alert platforms

    ◉ Escalate and analyse responses to alerts
  24. ◉ PHP APM based on Xhprof extension ◉ Safe to

    run in production ◉ Runs basic call logging and can manually log more detailed traces
  25. AVERTING DISASTERS ◉ Gain awareness into what’s happening with our

    application ◉ Observe passively (dashboards) or actively (alerts) ◉ Ensure we have visibility to fix problems ◉ Be sensible about what you collect ◉ Remember security risks of 3rd party services
  26. THANKS! Any questions? You can find me at @M1ke /

    Slack: #phpnw & #og-aws StuRents is hiring! Useful links https://speakerdeck.com/m1ke/disaster-averted https://github.com/M1ke/aws-log-checker https://github.com/jorgebastida/awslogs Bonus: would you be interested in a live demo/workshop on monitoring tools at a conference or user group? Presentation template by SlidesCarnival