Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud Run Reliability/Observability at ソウゾウ

Cloud Run Reliability/Observability at ソウゾウ

Ryuzo Yamamoto

April 19, 2023
Tweet

More Decks by Ryuzo Yamamoto

Other Decks in Technology

Transcript

  1. 4 • Architecture, Tech Stack • Observability ◦ Logs, Metrics,

    Traces • Reliability ◦ SLOs & Monitors as Code Agenda
  2. 5 Architecture Next.js Cloud Run GraphQL Cloud Run imgproxy Cloud

    Run microservice Cloud Run microservice Cloud Run Cloud Storage Cloud Load Balancing Cloud SQL Memorystore Cloud Run (70~ services) microservice(s) Cloud Run
  3. 6 Tech Stack • Monorepo ◦ Go, TypeScript, Python, Java

    ◦ 70~ microservices • Bazel, Turborepo • GraphQL / gRPC • Serverless (Cloud Run) • PostgreSQL, Redis • Cloud PubSub, Tasks, Workflows, Scheduler, VertexAI
  4. 7 • Architecture, Tech Stack • Observability ◦ Logs, Metrics,

    Traces • Reliability ◦ SLOs & Monitors as Code Agenda
  5. 8 • Logs ◦ JSON structured logging ◦ Cloud Logging

    -> BigQuery • Metrics ◦ Log-based Metrics ◦ Cloud Monitoring -> Datadog • Traces ◦ OpenTelemetry -> Datadog Observability
  6. 9 Observability - Logs microservice Cloud Run container log STDOUT

    / STDERR Logging BigQuery { "message": "failed to say hello", "something_id": "xxxxxxxx" "serviceContext": { "version": "1.0.1", "service": "echo" }, "metadata": { "user-agent": "graphql/1.0.0 grpc-node-js/1.7.3", } } Sink
  7. 10 Observability - Metrics microservice Cloud Run container log STDOUT

    / STDERR Logging { "message": "grpc: finished server unary /echo.EchoService/Hello", "grpc": { "type": "unary", "kind": "server", "latency": 0.002360152, "code": "OK", "method": "Hello", "service": "echo.EchoService" }, "serviceContext": { "version": "1.0.1", "service": "echo" }, "metadata": { "user-agent": "graphql/1.0.0 grpc-node-js/1.7.3", } } Log-based Metrics Monitoring Log-based Metrics + Other GCP Metrics gRPC interceptor
  8. 11 Observability - Traces Next.js Cloud Run GraphQL Cloud Run

    microservice Cloud Run microservice Cloud Run microservice(s) Cloud Run datadog-agent Cloud Run OTLP (gRPC)
  9. 12 • Architecture, Tech Stack • Observability ◦ Logs, Metrics,

    Traces • Reliability ◦ SLOs & Monitors as Code Agenda
  10. 13 Reliability - SLOs & Monitors as Code • SLO

    をすべての gRPC method 毎に設定 ◦ Availability (e.g. 99.9% / 30 days) ◦ Latency (e.g. p95 100ms) • 設定の自動化 ◦ protobuf plugin + Terraform module • Multiwindow, Multi-Burn-Rate Alerts ◦ https://docs.datadoghq.com/monitors/service_level_objectives/burn_rate/ ◦ https://sre.google/workbook/alerting-on-slos/#6-multiwindow-multi-burn-rate-alerts
  11. 14 Reliability - SLOs & Monitors as Code ... rpc

    Hello(HelloRequest) returns (HelloResponse) { option (extension.v2.method_monitoring) = { availability: { goal: 99.5 } latency: { threshold_ms: 100 percentile: 95 } }; } ... Terraform configuration protoc apply by CI (GitHub Actions) SLO monitors
  12. 15 • Architecture, Tech Stack • Observability ◦ Logs, Metrics,

    Traces • Reliability ◦ SLOs & Monitors as Code Wrap Up