4
Observability and resiliency patterns
in the cloud
Presented by Sara Gerion
@Sarutule
Slide 5
Slide 5 text
5
What is observability?
@Sarutule
Slide 6
Slide 6 text
6
What is resiliency?
@Sarutule
Slide 7
Slide 7 text
7
Let’s take a simple backend service
@Sarutule
Slide 8
Slide 8 text
8
✅ HTTP endpoint
@Sarutule
Slide 9
Slide 9 text
9
✅ Running on Node.js - Express.js
@Sarutule
Slide 10
Slide 10 text
10
✅ Database
@Sarutule
Slide 11
Slide 11 text
11
My simple backend service
@Sarutule
Slide 12
Slide 12 text
12
Translation in AWS-ese
@Sarutule
Slide 13
Slide 13 text
13
Scenario: failures in your runtime application
@Sarutule
Slide 14
Slide 14 text
14
Scenario: failures in your runtime application
@Sarutule
Database connection issue? Missing environment variable?
Slide 15
Slide 15 text
15
Logging gives us information related to the
errors and operations of an application
@Sarutule
Slide 16
Slide 16 text
16
Logs in JSON format
@Sarutule
Slide 17
Slide 17 text
17
@Sarutule
Slide 18
Slide 18 text
18
@Sarutule
Slide 19
Slide 19 text
19
Sample logging enabled on stage & production
environments
@Sarutule
Slide 20
Slide 20 text
20
Log correlation-ID’s for traceability
@Sarutule
Slide 21
Slide 21 text
21
@Sarutule
Slide 22
Slide 22 text
22
Logging: check ✓
@Sarutule
Slide 23
Slide 23 text
23
Scenario: failures in your underlying infrastructure
@Sarutule
Slide 24
Slide 24 text
24
Metrics give us information related to the state of
the underlying infrastructure via discrete values
collected during a specific amount of time
@Sarutule
Slide 25
Slide 25 text
25
CloudWatch Metrics
@Sarutule
Slide 26
Slide 26 text
26
USEFUL STRATEGIES
Capture missing data
▪ Missing logs
▪ Incorrect number of running resources
HTTP requests
▪ Latency of inbound & outbound requests
▪ Http status
▪ Missing http status (error on a TCP / connection layer)
Don’t forget to study your data before getting started!
Metrics
@Sarutule
Slide 27
Slide 27 text
27
Metrics: check ✓
@Sarutule
Slide 28
Slide 28 text
28
Alerts tell us over any anomaly happening
in our system
@Sarutule
Slide 29
Slide 29 text
29
Missing data
▪ Notify if service unavailable
Definition of different alert types
▪ Custom recipients and custom medium based on severity
▪ Better escalation & visibility for stakeholders
Study and your data (should this be a metric or the context of a metric?)
▪ Better metrics filtering
▪ Less data pollution
Monitor also the stage environment
▪ Get notified of errors before they appear on prod
Monitors & alerts
@Sarutule
Slide 30
Slide 30 text
30
Alerts: check ✓
@Sarutule
Slide 31
Slide 31 text
31
While tackling resiliency,
we need to talk about AWS infrastructure
@Sarutule
Slide 32
Slide 32 text
32
@Sarutule
Slide 33
Slide 33 text
33
Where is my service, physically?
@Sarutule
Slide 34
Slide 34 text
34
Scenario: outage of AZ eu-central-1a
@Sarutule
Slide 35
Slide 35 text
35
Multi AZ replication
@Sarutule
Slide 36
Slide 36 text
36
Load balancing & data replication
@Sarutule
Slide 37
Slide 37 text
37
Can we simplify this?
@Sarutule
Slide 38
Slide 38 text
38
Simplifying with serverless
@Sarutule
Slide 39
Slide 39 text
39
Pros
Highly-scalable
@Sarutule
Slide 40
Slide 40 text
40
Pros
Resilient
@Sarutule
Slide 41
Slide 41 text
41
Pros
Ad-hoc logs and metrics
@Sarutule
Slide 42
Slide 42 text
42
Simplifying with serverless - even when multi-region!
@Sarutule
44
◆ Redundancy of computing units
◆ Replication of data
◆ Load balancing
◆ DNS routing
◆ Delivery at the edge
Resiliency patterns
@Sarutule
Slide 45
Slide 45 text
45
◆ What can go wrong?
◆ How should my system react in case of that specific failure?
◆ Who should be notified, and how?
◆ How can we limit the impact for customers?
Useful questions during design phase
@Sarutule