Disaster Averted - Speaker Deck

Slide 1

Slide 1 text

Disaster Averted: Tools to increase visibility & solve problems

Slide 2

Slide 2 text

HELLO! I’m Mike Lehan Software engineer, CTO of StuRents.com, skydiver, northerner Follow me on Twitter @M1ke

Slide 3

Slide 3 text

Lets Recap What are we trying to achieve? 0

Slide 4

Slide 4 text

WHAT DISASTERS ARE WE DESIGNING FOR ◉ Spotting issues ◉ Availability ◉ Data consistency ◉ Security Assuming good practice is already followed in: ◉ Testing ◉ Backups ◉ Authentication

Slide 5

Slide 5 text

Awareness We need to know if there’s an incident going on, spot if there are conditions leading up to an incident, and assure that things are working as expected WE WANT TOOLS WHICH PROVIDE US WITH Visibility If something isn’t right we need ways to isolate affected parts of the application, review relevant history, and deep-dive into specific areas to find the problem

Slide 6

Slide 6 text

AWARENESS: THE EVERYTHING’S OK ALARM An alarm that doesn’t alarm isn’t a good alarm How often do you test your alarms?

Slide 7

Slide 7 text

VISIBILITY: THE TOILET DOOR PROBLEM Did you actually lock the door? Does the lock work as intended? What we can’t see should make us nervous!

Slide 8

Slide 8 text

Logging Usually aimed at providing Visibility, some tools use these to add Awareness Metrics Primarily give us Awareness, but can add to Visibility

Slide 9

Slide 9 text

Alerting All about Awareness, require prior knowledge of what might go wrong Dashboards Generally about Awareness, can be used for Visibility Tracing/APM Next level Visibility beyond logs - see exactly what happened

Slide 10

Slide 10 text

AWS CloudWatch Complex, comprehensive, available and low cost 1

Slide 11

Slide 11 text

WHAT DOES CLOUDWATCH OFFER ◉ Metrics ◉ Dashboards ◉ Logging ◉ Alerts ◉ Events Not just for applications hosted on AWS As with all AWS products, pay-as-you-go with generous free tier

Slide 12

Slide 12 text

“SIMPLE” PRICING? Only for those with a degree in WTF

Slide 13

Slide 13 text

GRAPH YOUR METRICS ◉ Stores data down to minute or second resolution ◉ 15 months retention, but less granular over time ◉ Map up to 1440 points per graph, aggregate Max, Min, Avg, Count, Sum and Percentiles Bonus: Let’s you play games to find unrelated metrics which correlate

Slide 14

Slide 14 text

USEFUL METRICS FOR AWS USERS Load balancers Number of requests Target response times Number of 5xx codes EC2 Average/Max CPU utilization CPU credit balance RAM Usage? Er, nope... Autoscaling Number of instances (enable this as a special option) RDS CPU utilization Swap usage Burst credits (!!!) CPU credit balance EFS Disk I/O credits Size stored (costly!) Cloudfront Number of requests Bytes transferred Cache misses Number of 4xx codes

Slide 15

Slide 15 text

Cloudwatch Agent Installs as a service on Linux servers - EC2 or your own server Configured to collect system level metrics - CPU, RAM, Disk, Network SYSTEM & APPLICATION METRICS PUT metrics API Accessible via “awscli” or SDK in PHP Submit custom metrics with your own namespace Requests can be made async

Slide 16

Slide 16 text

APPLICATION METRICS ARE POWERFUL Constants What happens lots in your application that indicates things are basically working? Critical Are there key transactions that have a large bearing on your business case? Errors Generally the place of logs rather than metrics Can we turn error logs into metrics?

Slide 17

Slide 17 text

APPLICATION METRICS ARE NOT JUST FOR ENGINEERS Business analysis Does your company use BA tools? Metrics can be better & cheaper Market research You may have valuable data that you’re not even aware of Fun Every team benefits from a chance to see what their application is doing.

Slide 18

Slide 18 text

LOG AGGREGATION ◉ Log groups and streams - split by use and entity ◉ Automatic expiration - save money and reduce search overhead ◉ Search by text and date - across all streams in a group ◉ Filters turn logs into metrics - by pattern matching

Slide 19

Slide 19 text

WHERE DO LOGS COME FROM? Internal AWS Most AWS services don’t provide logs in Cloudwatch RDS lets you export default logs Lambda logs natively CloudWatch Agent Two confusingly named agents Config file chooses files or directories to stream as logs Auto-rewrite config file to find new log files API/SDK You can push log entries directly. Not the best experience; API is intended to be used to push logs in batches

Slide 20

Slide 20 text

WHICH LOGS ARE IMPORTANT? Awareness ◉ Auth log ◉ Apache error log ◉ MySQL error log ◉ MySQL audit log Visibility ◉ Syslog ◉ Dmesg / messages ◉ MySQL general log Both ◉ Apache access log ◉ MySQL slow query log ◉ Your application logs

Slide 21

Slide 21 text

WATCH YOUR METRICS WITH DASHBOARDS

Slide 22

Slide 22 text

DON’T MISS A THING We have metrics & logs to provide us with awareness If someone isn’t watching our dashboards 24/7 we can still miss something

Slide 23

Slide 23 text

ALERTS NEED YOU TO PREDICT WHAT MIGHT INDICATE WRONGNESS

Slide 24

Slide 24 text

LIKE MUCH OF AWS ALERT HANDLING IS SELF-SERVICE ◉ Simple Notification Service ◉ Notifications by email, SMS, webhook ◉ Emails & SMS are basic ◉ Notifications are one time ◉ Alerts are cheap - basically free for low volume ◉ Webhooks can send to external tools

Slide 25

Slide 25 text

Datadog A leader in monitoring tools, simple pricing, loads of integrations 2

Slide 26

Slide 26 text

WANT BIG IMPACT? USE BIG IMAGE.

Slide 27

Slide 27 text

GENERAL RULE DATADOG vs CLOUDWATCH ◉ If Cloudwatch does it: Datadog does it & more ◉ Exception: metrics from AWS’ own services ◉ Different approach to log aggregation Cloudwatch Agent <-> Dogstatsd (Based on Statsd)

Slide 28

Slide 28 text

SIMPLER PRICING Generally higher cost

Slide 29

Slide 29 text

SIMPLER PRICING Still kind of confusing...

Slide 30

Slide 30 text

TWO TYPES OF DASHBOARDS

Slide 31

Slide 31 text

TIME SERIES TO SPOT ANOMALIES

Slide 32

Slide 32 text

SUPER POWERFUL LOG ANALYSIS

Slide 33

Slide 33 text

ANALYSE AWS LOGS BETTER THAN AWS

Slide 34

Slide 34 text

ALERTS THAT LEARN

Slide 35

Slide 35 text

APPLICATION PERFORMANCE MONITORING (APM)? WELL NOT QUITE

Slide 36

Slide 36 text

KEEN TO SEE OFF THE COMPETITION

Slide 37

Slide 37 text

New Relic The one with the free t-shirts 3

Slide 38

Slide 38 text

SUPER PRICEY APM Oddly cheap infrastructure

Slide 39

Slide 39 text

PHP APM WITH BONUS JARGON

Slide 40

Slide 40 text

EASY APM TO INTEGRATE Name your route and go

Slide 41

Slide 41 text

HOW NEW RELIC COMPARES ◉ Most established tool but might be struggling to keep up ◉ All about the APM; no logging ◉ Similar agent approach to Datadog ◉ Clunky interface ◉ Also includes error monitoring as part of APM

Slide 42

Slide 42 text

Sentry Super simple error monitor 4

Slide 43

Slide 43 text

NEVER HEARD OF THIS ONE? ◉ Most single-use tool so far - just does error monitoring ◉ Logs via HTTP calls so would add minor overhead ◉ Integrates into PHP with composer ◉ Also integrates with JS via front end script

Slide 44

Slide 44 text

RECORDS AND GROUPS ERRORS

Slide 45

Slide 45 text

DETAILED ERROR INFO STACK TRACES & BREADCRUMBS

Slide 46

Slide 46 text

GOOD PRICING TO START PAYG in between tiers

Slide 47

Slide 47 text

WHERE DOES THIS FIT IN? ◉ Can be used to observe ongoing errors ◉ Assist handling of errors by the application ◉ Allows alerting ◉ Can feed event data into Datadog for unified dashboards

Slide 48

Slide 48 text

and the rest No end of tools or ways to combine them 5

Slide 49

Slide 49 text

◉ Pure logging tool, can stream logs and understand AWS logs ◉ Cool online tail feature allows log observation in real time

Slide 50

Slide 50 text

◉ Alert managing tool, can feed from other alert platforms ◉ Escalate and analyse responses to alerts

Slide 51

Slide 51 text

◉ PHP APM based on Xhprof extension ◉ Safe to run in production ◉ Runs basic call logging and can manually log more detailed traces

Slide 52

Slide 52 text

In summary Will these tools actually help us avert disaster? 6

Slide 53

Slide 53 text

AVERTING DISASTERS ◉ Gain awareness into what’s happening with our application ◉ Observe passively (dashboards) or actively (alerts) ◉ Ensure we have visibility to fix problems ◉ Be sensible about what you collect ◉ Remember security risks of 3rd party services

Slide 54

Slide 54 text

THANKS! Any questions? You can find me at @M1ke / Slack: #phpnw & #og-aws StuRents is hiring! Useful links https://speakerdeck.com/m1ke/disaster-averted https://github.com/M1ke/aws-log-checker https://github.com/jorgebastida/awslogs Bonus: would you be interested in a live demo/workshop on monitoring tools at a conference or user group? Presentation template by SlidesCarnival