Monitoring Wonderland

M O N I T O R I N G
W O N D E R L A N D H E L P, W H AT I S H A P P E N I N G ?

PA U L S E I F F E RT
Team Lead at Jimdo    Cloud Infrastructure Engineer @seiffertp  [email protected]

W O N D E R L A N D

• Jimdo’s internal PaaS that runs 400 services • 5000
Docker containers at a time • ~600 deployments a Day W O N D E R L A N D

W O N D E R L A N D
AW S O T H E R S E R V I C E P R O V I D E R S I N F R A S T R U C T U R E A U T O M AT I O N A P I S M O N I T O R I N G ,   L O G G I N G C L I T O O L S W O N D E R L A N D O T H E R T O O L I N G

W O N D E R L A N D
W O N D E R L A N D A P I AW S E C S E C S A G E N T L O G G I N G   D A E M O N M E T R I C   D A E M O N EC2  Instance

• Your team is responsible for the software component that
delivers 20 million customer websites • You are on-call this night I M A G I N E …

4 : 0 0 A M

4 : 0 1 A M

4 : 0 1 A M Partial outage of  web
delivery component

• either because a health check failed • or because
a metric exceeded a configured threshold PA G E R D U T Y C A L L S

H E A LT H C H E C K
S A L E RT   M A N A G E R P R O M E T H E U S

• All services on Wonderland: Route53 health checks • Infrastructure
components: Pingdom checks A P I H E A LT H C H E C K S GET /health  HTTP/1.1 200 OK

• Workers write a metric after each processed message to
the Prometheus pushgateway • For cron jobs, Wonderland automatically notifies cronitor.io about executions • Dead man’s switch: If not notified for a certain time an alert is created W O R K E R H E A LT H C H E C K S

Run tests against production periodically,  monitor results, and alert on
issues S E M A N T I C M O N I T O R I N G S Y N T H E T I C M O N I T O R I N G

4 : 1 0 A M Service still running

S E R V I C E D A S
H B O A R D

G R A FA N A • Each service running
on Wonderland automatically has a dashboard showing key metrics for debugging • Developers can create custom dashboards for more detailed analysis • Grafana pulls data from Prometheus instances

P R O M E T H E U S
• Semi-centralized metric system • Pull-based metric retrieval • On-the-fly calculation of derived metrics

M E T R I C S I N F
R A S T R U C T U R E M E T R I C S S Y S T E M M E T R I C S A P P L I C AT I O N M E T R I C S

I N F R A S T R U C
T U R E M E T R I C S P R O M E T H E U S C L O U D WAT C H E X P O RT E R AW S C U S T O M E X P O RT E R S W O N D E R L A N D A P I S

E X A M P L E S aws_autoscaling_group_desired_capacity_average{ auto_scaling_group_name="crims", 
job="cloudwatch_exporter"  } aws_elb_request_count_sum{  cluster=“crims",  job="wonderland_elb_exporter",  service_name="web-prod"  }

S Y S T E M M E T R
I C S P R O M E T H E U S C O L L E C T D C A D V I S O R

E X A M P L E S container_memory_rss{  container_label_cluster="crims", 
container_label_container_name="web-prod--web",  image="web-prod:abc123",  instance="10.8.4.91:9104",  job=“crims_cadvisor_metrics"  } collectd_memory{  instance="10.8.4.42:9103",  job="crims_collectd_metrics",  memory="free"  }

A P P L I C AT I O N
M E T R I C S P R O M E T H E U S C O N TA I N E R A C O N TA I N E R B … GET /metrics

P R O M E T H E U S
C O N TA I N E R A C O N TA I N E R B … W O N D E R L A N D S E R V I C E D I S C O V E RY W O N D E R L A N D A P I update  config locate    containers scrape  metrics and  reload S E R V I C E D I S C O V E RY

M E T R I C R E T E
N T I O N

http_requests_total{instance=“10.8.3.101:80”} = 53  http_requests_total{instance=“10.8.3.102:80”} = 81  http_requests_total{instance=“10.8.3.103:80”} = 2 ...
job:http_requests_total:sum = sum(http_requests_total) without (instance) = 136 Automatically generated recording rules: 

L O N G - T E R M -
P R O M E T H E U S S H O RT- T E R M   P R O M E T H E U S scrape    filtered metrics 'match[]': - '{job="application_metrics", instance=""}' 3 2 D AY S 3 0 M I N F E D E R AT I O N

L O N G - T E R M -
P R O M E T H E U S S H O RT- T E R M   P R O M E T H E U S scrape    filtered metrics http_requests_total{instance=“10.8.3.101:80”}  http_requests_total{instance=“10.8.3.102:80”}  http_requests_total{instance=“10.8.3.103:80”}  ...  job:http_requests_total:sum{} job:http_requests_total:sum{}

S E R V I C E D A S
H B O A R D

4 : 1 2 A M Auto-Scaling broken

L E T ’ S TA K E A L
O O K AT T H E L O G S

• Centralised logging is a must-have in a distributed system
• It should be very easy to gather all information that concerns a service C E N T R A L I S E D L O G G I N G

• Output of all services running on Wonderland is stored
centrally • Optionally logs are parsed with configurable formats C E N T R A L I S E D L O G G I N G $ cat wonderland.yaml  --- components: - name image: my-nginx-image logging: types: - access_log - error_log_nginx

C E N T R A L I S E
D L O G G I N G D O C K E R L O G B E AT L O G Z . I O fluentd    protocol lumberjack    protocol Wonderland Logbeat • receives logs via fluent protocol, • parses them, • adds metadata, • and streams them to our logging provider logz.io

4 : 1 7 A M You ﬁnd this log
message of the service autoscaler: Unable to scale-out service “web- delivery”. Configured maximum number of instances reached.

4 : 1 7 A M You increase the maximum
number of instances: $ cat wonderland.yaml   […]  auto-scaling:  min-instances: 60  max-instances: 150

4 : 2 0 A M Back to bed

2 : 0 0 P M In the PMA for
this night’s incident, you create the action item to Monitor the number of instances of web-delivery to detect potential breaches of auto-scaling limits before affecting the system’s health

Q U E S T I O N S ?

T H A N K Y O U

F U RT H E R R E A D
I N G / S O U R C E S • Beyer, Jones, Petoff & Murphy  Site Reliability Engineering • Susan Fowler  Production-Ready Microservices • Sam Newman  Building Microservices • Stripe / Increment  On-Call (https://increment.com/on-call/) • Mathias Lafeldt & Paul Seiffert  A Journey Through Wonderland  (https://speakerdeck.com/mlafeldt/a-journey-through-wonderland)

F O T O S • Marcel Stockmann  https://www.flickr.com/photos/marcelstockmann/33068471286 •
Michael Theis  https://www.flickr.com/photos/huskyte/6931056896

Monitoring Wonderland

Monitoring Wonderland

Other Decks in Technology

Featured

Transcript