Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring Wonderland

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Monitoring Wonderland

Avatar for Paul Seiffert

Paul Seiffert

March 13, 2019
Tweet

Other Decks in Technology

Transcript

  1. M O N I T O R I N G

    W O N D E R L A N D H E L P, W H AT I S H A P P E N I N G ?
  2. PA U L S E I F F E RT

    Team Lead at Jimdo
 
 Cloud Infrastructure Engineer @seiffertp
 [email protected]
  3. • Jimdo’s internal PaaS that runs 400 services • 5000

    Docker containers at a time • ~600 deployments a Day W O N D E R L A N D
  4. W O N D E R L A N D

    AW S O T H E R S E R V I C E P R O V I D E R S I N F R A S T R U C T U R E A U T O M AT I O N A P I S M O N I T O R I N G , 
 L O G G I N G C L I T O O L S W O N D E R L A N D O T H E R T O O L I N G
  5. W O N D E R L A N D

    W O N D E R L A N D A P I AW S E C S E C S A G E N T L O G G I N G 
 D A E M O N M E T R I C 
 D A E M O N EC2
 Instance
  6. • Your team is responsible for the software component that

    delivers 20 million customer websites • You are on-call this night I M A G I N E …
  7. • either because a health check failed • or because

    a metric exceeded a configured threshold PA G E R D U T Y C A L L S
  8. H E A LT H C H E C K

    S A L E RT 
 M A N A G E R P R O M E T H E U S
  9. • All services on Wonderland: Route53 health checks • Infrastructure

    components: Pingdom checks A P I H E A LT H C H E C K S GET /health
 HTTP/1.1 200 OK
  10. • Workers write a metric after each processed message to

    the Prometheus pushgateway • For cron jobs, Wonderland automatically notifies cronitor.io about executions • Dead man’s switch: If not notified for a certain time an alert is created W O R K E R H E A LT H C H E C K S
  11. Run tests against production periodically,
 monitor results, and alert on

    issues S E M A N T I C M O N I T O R I N G S Y N T H E T I C M O N I T O R I N G
  12. S E R V I C E D A S

    H B O A R D
  13. G R A FA N A • Each service running

    on Wonderland automatically has a dashboard showing key metrics for debugging • Developers can create custom dashboards for more detailed analysis • Grafana pulls data from Prometheus instances
  14. P R O M E T H E U S

    • Semi-centralized metric system • Pull-based metric retrieval • On-the-fly calculation of derived metrics
  15. M E T R I C S I N F

    R A S T R U C T U R E M E T R I C S S Y S T E M M E T R I C S A P P L I C AT I O N M E T R I C S
  16. I N F R A S T R U C

    T U R E M E T R I C S P R O M E T H E U S C L O U D WAT C H E X P O RT E R AW S C U S T O M E X P O RT E R S W O N D E R L A N D A P I S
  17. E X A M P L E S aws_autoscaling_group_desired_capacity_average{ auto_scaling_group_name="crims",


    job="cloudwatch_exporter"
 } aws_elb_request_count_sum{
 cluster=“crims",
 job="wonderland_elb_exporter",
 service_name="web-prod"
 }
  18. S Y S T E M M E T R

    I C S P R O M E T H E U S C O L L E C T D C A D V I S O R
  19. E X A M P L E S container_memory_rss{
 container_label_cluster="crims",


    container_label_container_name="web-prod--web",
 image="web-prod:abc123",
 instance="10.8.4.91:9104",
 job=“crims_cadvisor_metrics"
 } collectd_memory{
 instance="10.8.4.42:9103",
 job="crims_collectd_metrics",
 memory="free"
 }
  20. A P P L I C AT I O N

    M E T R I C S P R O M E T H E U S C O N TA I N E R A C O N TA I N E R B … GET /metrics
  21. P R O M E T H E U S

    C O N TA I N E R A C O N TA I N E R B … W O N D E R L A N D S E R V I C E D I S C O V E RY W O N D E R L A N D A P I update
 config locate
 
 containers scrape
 metrics and
 reload S E R V I C E D I S C O V E RY
  22. http_requests_total{instance=“10.8.3.101:80”} = 53
 http_requests_total{instance=“10.8.3.102:80”} = 81
 http_requests_total{instance=“10.8.3.103:80”} = 2 ...

    job:http_requests_total:sum = sum(http_requests_total) without (instance) = 136 Automatically generated recording rules:

  23. L O N G - T E R M -

    P R O M E T H E U S S H O RT- T E R M 
 P R O M E T H E U S scrape
 
 filtered metrics 'match[]': - '{job="application_metrics", instance=""}' 3 2 D AY S 3 0 M I N F E D E R AT I O N
  24. L O N G - T E R M -

    P R O M E T H E U S S H O RT- T E R M 
 P R O M E T H E U S scrape
 
 filtered metrics http_requests_total{instance=“10.8.3.101:80”}
 http_requests_total{instance=“10.8.3.102:80”}
 http_requests_total{instance=“10.8.3.103:80”}
 ...
 job:http_requests_total:sum{} job:http_requests_total:sum{}
  25. S E R V I C E D A S

    H B O A R D
  26. L E T ’ S TA K E A L

    O O K AT T H E L O G S
  27. • Centralised logging is a must-have in a distributed system

    • It should be very easy to gather all information that concerns a service C E N T R A L I S E D L O G G I N G
  28. • Output of all services running on Wonderland is stored

    centrally • Optionally logs are parsed with configurable formats C E N T R A L I S E D L O G G I N G $ cat wonderland.yaml
 --- components: - name image: my-nginx-image logging: types: - access_log - error_log_nginx
  29. C E N T R A L I S E

    D L O G G I N G D O C K E R L O G B E AT L O G Z . I O fluentd
 
 protocol lumberjack
 
 protocol Wonderland Logbeat • receives logs via fluent protocol, • parses them, • adds metadata, • and streams them to our logging provider logz.io
  30. 4 : 1 7 A M You find this log

    message of the service autoscaler: Unable to scale-out service “web- delivery”. Configured maximum number of instances reached.
  31. 4 : 1 7 A M You increase the maximum

    number of instances: $ cat wonderland.yaml 
 […]
 auto-scaling:
 min-instances: 60
 max-instances: 150
  32. 2 : 0 0 P M In the PMA for

    this night’s incident, you create the action item to Monitor the number of instances of web-delivery to detect potential breaches of auto-scaling limits before affecting the system’s health
  33. F U RT H E R R E A D

    I N G / S O U R C E S • Beyer, Jones, Petoff & Murphy
 Site Reliability Engineering • Susan Fowler
 Production-Ready Microservices • Sam Newman
 Building Microservices • Stripe / Increment
 On-Call (https://increment.com/on-call/) • Mathias Lafeldt & Paul Seiffert
 A Journey Through Wonderland
 (https://speakerdeck.com/mlafeldt/a-journey-through-wonderland)
  34. F O T O S • Marcel Stockmann
 https://www.flickr.com/photos/marcelstockmann/33068471286 •

    Michael Theis
 https://www.flickr.com/photos/huskyte/6931056896