Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Global Cross-Cloud Monitoring Platform

Building a Global Cross-Cloud Monitoring Platform

KubeCon + CloudNativeCon | Open Source Summit China 2019

How Improbable have leveraged Open Source to build a Global Cross-Cloud monitoring platform.

Dominic Green

June 25, 2019
Tweet

More Decks by Dominic Green

Other Decks in Technology

Transcript

  1. Building a Global Cross-Cloud Monitoring Platform Dominic Green, Software Engineer

    [email protected], @domgreen 25th June 2019, Shanghai China Yifan Zhao, Improbable China, Co-founder [email protected]
  2. Yifan Zhao ▪ Co-founder of Improbable China ◦ Heads up

    the Engineer Division ▪ Part of Improbable’s Founding Team ◦ Core contributor to the SpatialOS platform ◦ Lead the built-out of the infrastructure team in London
  3. If a tree falls in a forest and no one

    is around to hear it, does it make a sound? 如果森林里的一棵树倒了,当时周围没有人, 它发出声音了吗?
  4. 370 Founded: Games in Development: 20+ 2012 Employees: Our Mission:

    Make Impossible Games Possible "Improbable’s platform, SpatialOS, is designed to let anyone build massive simulations, running in the cloud: imagine Minecraft with thousands of players in the same space…. Its ultimate goal: to create totally immersive, persistent virtual worlds." - WIRED, May 2017
  5. ▪ Software Engineer @ Improbable ▪ Observability Team ▪ OSS

    Contributor ◦ Thanos ◦ go-grpc-middleware ◦ go-httpwares ▪ Meetup Organiser ◦ Prometheus London ◦ London Gophers Dominic Green
  6. Define: Monitoring Collecting, processing, aggregating, and displaying real-time quantitative data

    about a system, such as query counts and types, error counts and types, processing times, and server lifetimes. https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/ @domgreen “ ”
  7. Prometheus /metrics # TYPE counter app_request_total 1337 # TYPE gauge

    app_request_in_flight_total 3 # TYPE histogram app_request_duration_ms_bucket {le="0.005"} 500 app_request_duration_ms_bucket {le="0.01"} 213 @domgreen
  8. Prometheus @domgreen Query Engine Scrape Engine Compactor Rule & Alert

    Engine Prometheus Service X Service X Services /metrics every 15s HTTP Query API Grafana Alertmanager Local storage
  9. Prometheus pod-a pod-b pod-c apiVersion: v1 kind: Pod metadata: annotations:

    prometheus.io/path: /metrics prometheus.io/port: "8080" prometheus.io/scheme: http prometheus.io/scrape: "true" scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: ... Prometheus SSD blocks @domgreen
  10. ▪ Basis for workload management ▪ Kubernetes Service Discovery for

    discovering workloads ▪ Mature tooling and automation Single Cluster Kubernetes Prometheus ▪ Collection of data from workloads ▪ Data queried directly from Prometheus Scrapers ▪ Fast becoming industry standard for metric collection @domgreen
  11. • Cons ◦ Redundancy ◦ Latency Single Cluster • Pros

    ◦ Simple ◦ Easy to Monitor @domgreen
  12. Multi-Cluster - Prometheus US Games Cluster Prometheus EU Games Cluster

    Prometheus Hub Cluster (EU || US) Prometheus Grafana @domgreen
  13. Multi-Cluster - Federation US Games Cluster Prometheus EU Games Cluster

    Prometheus Hub Cluster (EU || US) Prometheus Grafana /federate /federate @domgreen
  14. Thanos Project ▪ Global query view of Metrics. ▪ Unlimited

    retention of Metrics. ▪ High availability of components, including Prometheus. ▪ Downsampling of Metrics. @domgreen
  15. Cluster N Cluster 0 Multi-Cluster - Thanos Sidecar US Games

    Cluster SSD Prometheus EU Games Cluster SSD Prometheus ... gRPC (Store API) gRPC (Store API) @domgreen
  16. Cluster N Cluster 0 Multi-Cluster - Thanos Query US Games

    Cluster SSD Prometheus EU Games Cluster SSD Prometheus ... Query HTTP & gRPC (Store API) @domgreen
  17. Multi-Cluster - Federation US Games Cluster Prometheus EU Games Cluster

    Prometheus Hub Cluster (EU || US) Prometheus Grafana /federate /federate @domgreen
  18. Hub Cluster (EU || US) Cluster N Cluster 0 Multi-Cluster

    - Thanos Query SSD Prometheus SSD Prometheus ... Query Query Hub Cluster (EU || US) Grafana SSD Prometheus Query @domgreen
  19. Cluster Object Storage SSD Prometheus blocks Currently Supported: - Google

    Cloud Storage - S3 - Azure Blob Storage - Tencent - Aliyun OSS (soon) Multi-Cluster - Thanos Sidecar @domgreen
  20. Hub Cluster (EU || US) Cluster N Cluster 0 Multi-Cluster

    - Storage SSD Prometheus SSD Prometheus ... Query Query Hub Cluster (EU || US) Grafana SSD Prometheus Query Object Storage @domgreen
  21. Hub Cluster (EU || US) Cluster N Cluster 0 Multi-Cluster

    - Storage SSD Prometheus SSD Prometheus ... Query Query Hub Cluster (EU || US) Grafana SSD Prometheus Query Object Storage Store gRPC (Store API) @domgreen
  22. Hub Cluster (EU || US) Cluster N Multi-Cluster - High

    Availability SSD Prometheus SSD Prometheus ... Query Hub Cluster (EU || US) SSD Prometheus Grafana SSD Prometheus Query Object Storage Store Cluster N SSD Prometheus SSD Prometheus Query “replica”: thanos-0 “replica”: thanos-1 “replica”: thanos-0 “replica”: thanos-1 @domgreen
  23. Hub Cluster (EU || US) Cluster N Multi-Cluster - Compaction

    SSD Prometheus SSD Prometheus ... Query Hub Cluster (EU || US) SSD Prometheus Grafana SSD Prometheus Query Object Storage Store Cluster N SSD Prometheus SSD Prometheus Query Compact @domgreen
  24. Hub Cluster (EU || US) Cluster N Multi-Cluster SSD Prometheus

    SSD Prometheus ... Query Hub Cluster (EU || US) SSD Prometheus Grafana SSD Prometheus Query Object Storage Store Cluster N SSD Prometheus SSD Prometheus Query Compact @domgreen
  25. Multi-Cluster Thanos ▪ Global View, Retention, HA, Downsampling ▪ Pulls

    Metrics from Object Storage or Thanos Sidecar ▪ Builds on existing Prometheus infrastructure ▪ Consistent approach in all clusters ▪ Kubernetes Service Discovery for discovering workloads ▪ Mature tooling and automation Kubernetes Prometheus ▪ Collection of data from workloads ▪ Federation can be problematic @domgreen
  26. • Cons ◦ Observability is harder ◦ Increased Complexity ◦

    Automation? ◦ Tooling? Multi-Cluster • Pros ◦ Reduced Latency ◦ High Availability ▪ Cluster Level ▪ Workload level ◦ Global Query ◦ Long Term Metrics @domgreen
  27. Hub Cluster (EU || US) Cluster N Multi-Cluster - Networking

    SSD Prometheus SSD Prometheus ... Query Hub Cluster (EU || US) SSD Prometheus Grafana SSD Prometheus Query Object Storage Store Cluster N SSD Prometheus SSD Prometheus Query Compact @domgreen
  28. Hub Cluster (EU || US) Cluster N Multi-Cluster - Networking

    SSD Prometheus SSD Prometheus ... Query Hub Cluster (EU || US) SSD Prometheus Grafana SSD Prometheus Query Object Storage Store Cluster N SSD Prometheus SSD Prometheus Query Compact @domgreen
  29. Hub Cluster (EU || US) Cluster N Multi-Cluster - Networking

    SSD Prometheus SSD Prometheus ... Query Hub Cluster (EU || US) SSD Prometheus Grafana SSD Prometheus Query Object Storage Store Cluster N SSD Prometheus SSD Prometheus Query Compact @domgreen
  30. Hub Cluster (EU || US) Cluster N Multi-Cloud - Networking

    SSD Prometheus SSD Prometheus ... Query Hub Cluster (EU || US) SSD Prometheus Grafana SSD Prometheus Query Object Storage Store Cluster N SSD Prometheus SSD Prometheus Query Compact @domgreen
  31. Hub Cluster (EU || US) Cluster N Multi-Cloud - Envoy

    SSD Prometheus SSD Prometheus ... Query Hub Cluster (EU || US) SSD Prometheus Grafana SSD Prometheus Query Object Storage Store Cluster N SSD Prometheus SSD Prometheus Query Compact Envoy Envoy Envoy @domgreen
  32. Hub Cluster (EU || US) Cluster N Multi-Cloud - Envoy

    SSD Prometheus SSD Prometheus ... Query Hub Cluster (EU || US) SSD Prometheus Grafana SSD Prometheus Query Object Storage Store Cluster N SSD Prometheus SSD Prometheus Query Compact Envoy Envoy Envoy https://envoy.gcp.i8e.io https://envoy.aws.i8e.io https://envoy.az.i8e.io https://envoy.ali.i8e.io @domgreen
  33. Hub Cluster (EU || US) Cluster N Multi-Cloud - Persistence

    SSD Prometheus SSD Prometheus ... Query Hub Cluster (EU || US) SSD Prometheus Grafana SSD Prometheus Query Object Storage Store Cluster N SSD Prometheus SSD Prometheus Query Compact Envoy Envoy Envoy @domgreen
  34. Hub Cluster (EU || US) Cluster N Multi-Cloud - Querying

    SSD Prometheus SSD Prometheus ... Query Hub Cluster (EU || US) SSD Prometheus Grafana SSD Prometheus Query Object Storage Store Cluster N SSD Prometheus SSD Prometheus Query Compact Envoy Envoy Envoy @domgreen
  35. Hub Cluster (EU || US) Multi-Cloud - Thanos Receive ...

    Hub Cluster (EU || US) SSD Prometheus Grafana SSD Prometheus Query Store Compact Cluster N SSD Prometheus SSD Prometheus Cluster N SSD Prometheus SSD Prometheus Receive Remote Write Object Storage @domgreen
  36. Multi-Cloud Thanos ▪ Global View, Retention, HA, Downsampling ▪ Flexible

    StoreAPI allows for different usage scenarios ▪ Consistent approach in all clouds & clusters ▪ Kubernetes Service Discovery for discovering workloads Kubernetes Prometheus ▪ Collection of metrics from workloads ▪ TSDB Storage format Envoy ▪ Edge Proxy for same approach cross-cluster and cross-cloud communication @domgreen
  37. • Cons ◦ Observability is harder ◦ Increased Complexity ◦

    Automation?? ◦ Tooling?? Multi-Cloud • Pros ◦ Reduced Latency ◦ High Availability ▪ Cluster Level ▪ Workload level ◦ Global Query ◦ Long Term Metrics @domgreen
  38. Learn More @domgreen • High Available + Scalable Prometheus with

    Thanos in Alibaba ◦ Guo'an Qin, Alibaba & Tao Li, Alibaba • Metric monitoring architecture at Improbable using Thanos ◦ Bartłomiej Płotka & Dominic Green ◦ https://improbable.io/blog/thanos-architecture-at-improbable • Autoscaling Multi-Cluster Observability with Thanos and Linkerd ◦ Andrew Seigner & Frederic Branczyk ◦ https://www.youtube.com/watch?v=qTxunwzYO0g • Thanos - Transforming Prometheus to a Global Scale in a Seven Simple Steps ◦ Bartłomiej Płotka ◦ https://www.youtube.com/watch?v=Iuo1EjCN5i4