Monitoring Wonderland - Speaker Deck

Slide 1

Slide 1 text

M O N I T O R I N G W O N D E R L A N D H E L P, W H AT I S H A P P E N I N G ?

Slide 2

Slide 2 text

PA U L S E I F F E RT Team Lead at Jimdo    Cloud Infrastructure Engineer @seiffertp  [email protected]

Slide 3

Slide 3 text

W O N D E R L A N D

Slide 4

Slide 4 text

• Jimdo’s internal PaaS that runs 400 services • 5000 Docker containers at a time • ~600 deployments a Day W O N D E R L A N D

Slide 5

Slide 5 text

W O N D E R L A N D AW S O T H E R S E R V I C E P R O V I D E R S I N F R A S T R U C T U R E A U T O M AT I O N A P I S M O N I T O R I N G ,   L O G G I N G C L I T O O L S W O N D E R L A N D O T H E R T O O L I N G

Slide 6

Slide 6 text

W O N D E R L A N D W O N D E R L A N D A P I AW S E C S E C S A G E N T L O G G I N G   D A E M O N M E T R I C   D A E M O N EC2  Instance

Slide 7

Slide 7 text

• Your team is responsible for the software component that delivers 20 million customer websites • You are on-call this night I M A G I N E …

Slide 8

Slide 8 text

4 : 0 0 A M

Slide 9

Slide 9 text

4 : 0 1 A M

Slide 10

Slide 10 text

4 : 0 1 A M Partial outage of  web delivery component

Slide 11

Slide 11 text

• either because a health check failed • or because a metric exceeded a configured threshold PA G E R D U T Y C A L L S

Slide 12

Slide 12 text

H E A LT H C H E C K S A L E RT   M A N A G E R P R O M E T H E U S

Slide 13

Slide 13 text

• All services on Wonderland: Route53 health checks • Infrastructure components: Pingdom checks A P I H E A LT H C H E C K S GET /health  HTTP/1.1 200 OK

Slide 14

Slide 14 text

• Workers write a metric after each processed message to the Prometheus pushgateway • For cron jobs, Wonderland automatically notifies cronitor.io about executions • Dead man’s switch: If not notified for a certain time an alert is created W O R K E R H E A LT H C H E C K S

Slide 15

Slide 15 text

Run tests against production periodically,  monitor results, and alert on issues S E M A N T I C M O N I T O R I N G S Y N T H E T I C M O N I T O R I N G

Slide 16

Slide 16 text

4 : 1 0 A M Service still running

Slide 17

Slide 17 text

S E R V I C E D A S H B O A R D

Slide 18

Slide 18 text

G R A FA N A • Each service running on Wonderland automatically has a dashboard showing key metrics for debugging • Developers can create custom dashboards for more detailed analysis • Grafana pulls data from Prometheus instances

Slide 19

Slide 19 text

P R O M E T H E U S • Semi-centralized metric system • Pull-based metric retrieval • On-the-fly calculation of derived metrics

Slide 20

Slide 20 text

M E T R I C S I N F R A S T R U C T U R E M E T R I C S S Y S T E M M E T R I C S A P P L I C AT I O N M E T R I C S

Slide 21

Slide 21 text

I N F R A S T R U C T U R E M E T R I C S P R O M E T H E U S C L O U D WAT C H E X P O RT E R AW S C U S T O M E X P O RT E R S W O N D E R L A N D A P I S

Slide 22

Slide 22 text

E X A M P L E S aws_autoscaling_group_desired_capacity_average{ auto_scaling_group_name="crims",  job="cloudwatch_exporter"  } aws_elb_request_count_sum{  cluster=“crims",  job="wonderland_elb_exporter",  service_name="web-prod"  }

Slide 23

Slide 23 text

S Y S T E M M E T R I C S P R O M E T H E U S C O L L E C T D C A D V I S O R

Slide 24

Slide 24 text

E X A M P L E S container_memory_rss{  container_label_cluster="crims",  container_label_container_name="web-prod--web",  image="web-prod:abc123",  instance="10.8.4.91:9104",  job=“crims_cadvisor_metrics"  } collectd_memory{  instance="10.8.4.42:9103",  job="crims_collectd_metrics",  memory="free"  }

Slide 25

Slide 25 text

A P P L I C AT I O N M E T R I C S P R O M E T H E U S C O N TA I N E R A C O N TA I N E R B … GET /metrics

Slide 26

Slide 26 text

P R O M E T H E U S C O N TA I N E R A C O N TA I N E R B … W O N D E R L A N D S E R V I C E D I S C O V E RY W O N D E R L A N D A P I update  config locate    containers scrape  metrics and  reload S E R V I C E D I S C O V E RY

Slide 27

Slide 27 text

M E T R I C R E T E N T I O N

Slide 28

Slide 28 text

http_requests_total{instance=“10.8.3.101:80”} = 53  http_requests_total{instance=“10.8.3.102:80”} = 81  http_requests_total{instance=“10.8.3.103:80”} = 2 ... job:http_requests_total:sum = sum(http_requests_total) without (instance) = 136 Automatically generated recording rules: 

Slide 29

Slide 29 text

L O N G - T E R M - P R O M E T H E U S S H O RT- T E R M   P R O M E T H E U S scrape    filtered metrics 'match[]': - '{job="application_metrics", instance=""}' 3 2 D AY S 3 0 M I N F E D E R AT I O N

Slide 30

Slide 30 text

L O N G - T E R M - P R O M E T H E U S S H O RT- T E R M   P R O M E T H E U S scrape    filtered metrics http_requests_total{instance=“10.8.3.101:80”}  http_requests_total{instance=“10.8.3.102:80”}  http_requests_total{instance=“10.8.3.103:80”}  ...  job:http_requests_total:sum{} job:http_requests_total:sum{}

Slide 31

Slide 31 text

S E R V I C E D A S H B O A R D

Slide 32

Slide 32 text

4 : 1 2 A M Auto-Scaling broken

Slide 33

Slide 33 text

L E T ’ S TA K E A L O O K AT T H E L O G S

Slide 34

Slide 34 text

• Centralised logging is a must-have in a distributed system • It should be very easy to gather all information that concerns a service C E N T R A L I S E D L O G G I N G

Slide 35

Slide 35 text

• Output of all services running on Wonderland is stored centrally • Optionally logs are parsed with configurable formats C E N T R A L I S E D L O G G I N G $ cat wonderland.yaml  --- components: - name image: my-nginx-image logging: types: - access_log - error_log_nginx

Slide 36

Slide 36 text

C E N T R A L I S E D L O G G I N G D O C K E R L O G B E AT L O G Z . I O fluentd    protocol lumberjack    protocol Wonderland Logbeat • receives logs via fluent protocol, • parses them, • adds metadata, • and streams them to our logging provider logz.io

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

4 : 1 7 A M You ﬁnd this log message of the service autoscaler: Unable to scale-out service “web- delivery”. Configured maximum number of instances reached.

Slide 39

Slide 39 text

4 : 1 7 A M You increase the maximum number of instances: $ cat wonderland.yaml   […]  auto-scaling:  min-instances: 60  max-instances: 150

Slide 40

Slide 40 text

4 : 2 0 A M Back to bed

Slide 41

Slide 41 text

2 : 0 0 P M In the PMA for this night’s incident, you create the action item to Monitor the number of instances of web-delivery to detect potential breaches of auto-scaling limits before affecting the system’s health

Slide 42

Slide 42 text

Q U E S T I O N S ?

Slide 43

Slide 43 text

T H A N K Y O U

Slide 44

Slide 44 text

F U RT H E R R E A D I N G / S O U R C E S • Beyer, Jones, Petoff & Murphy  Site Reliability Engineering • Susan Fowler  Production-Ready Microservices • Sam Newman  Building Microservices • Stripe / Increment  On-Call (https://increment.com/on-call/) • Mathias Lafeldt & Paul Seiffert  A Journey Through Wonderland  (https://speakerdeck.com/mlafeldt/a-journey-through-wonderland)

Slide 45

Slide 45 text

F O T O S • Marcel Stockmann  https://www.flickr.com/photos/marcelstockmann/33068471286 • Michael Theis  https://www.flickr.com/photos/huskyte/6931056896