Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud Native Observability

Cloud Native Observability

Traces, Logs and Metrics in the time of Containers

Joy Bhattacherjee

May 26, 2018
Tweet

More Decks by Joy Bhattacherjee

Other Decks in Technology

Transcript

  1. The Correctness Pyramid Unit Tests Integration Tests Smoke Tests Dark

    Prod Canary Prod Chaos Fuzz Unpredictable Input Undesired State
  2. The Availability Equation Ao = Tm / Tm + Td

    Ap = Tm / Tm + Tp Tp = ----------------------- MTTR + MLDT + MAMDT MTBF Ao = Operational Availability Ap = Predictive Availability Tm = Task Duration Td = Downtime Tp = Predictive Downtime MTTR = Mean Time to Recover MLDT = Mean Logistics Delay Time MAMDT = Mean Active Maintenance Downtime MTBF = Mean Time Between Failure
  3. Look Cap’n! It’s an Iceberg. Symptoms Root Cause Request Rate

    Error Duration Utlization Saturation Errors Metrics Logs Traces HealthCheck Uptime Latency Profiler Debugger Dependency Analyzer
  4. Observability Effort to increase intrinsic predictability of a system, for

    infinitely arbitrary input states and across infinite mutations of the underlying platform... ...by recording granular system states continuously, so as to feed it back to the system in the form of architectural changes and instrumentation that further increases visibility. System Visibility Understanding Architecture Instrumentation
  5. The Three Pillars, a Taxonomy Logs Metrics Traces Plantext Structured

    Binary RED USE SLI SLO Violation Alerting Playbooks Recovery Tracing Exception Handling Debugging Profiling RCA RCA Audit Anomaly Capacity
  6. VPC0 Subnet IGW Cloud Native Systems Are Complex! VPC1 RTab

    Subnet ELB ASG NAT VPN VPC2 EC2 KVM OS Runtime Container Cache Queue Datastores DR Peering Distributed Computing
  7. Lifecycle of A Request in K8S ELB WAF API Gateway

    Ingress Service Deployment Pod LOGS, REQ-COUNT ATTACK PREVENTION RATE LIMIT, SERVICE DISCOVERY HA, LOAD-BALANCE SOA ABSTRACTION ORCHESTRATION, HEALTH CODE EXECUTION
  8. Cloud Native + Observability • HealthChecks • Load Balancing •

    Failed Service Rotation • Service Discovery • Reduction in catastrophic failures Out-of-the-box To-Do • Small but numerous failures from a myriad of moving parts • Centralized logging across IaaS, PaaS, OS, Application • Metric isolation and aggregation across multiple abstractions and virtualization layers • Capacity Planning across 3 to 4 levels of virtualization • Distributed Tracing across 10 to 100s of microservices
  9. Observation Quality Observation Quality = f (System Grain, System Context)

    • Node-level metrics for Cluster AutoScaling ◦ Node Exporter • Pod-level Metrics for Horizontal Pod AutoScaling ◦ CAdvisor ◦ Metrics Server • Kubernetes Platform Metrics to determine health of Orchestration Layer ◦ Kube-state-metrics
  10. Logging in K8S with EFK APP NAMESPACE LOG NAMESPACE COREOS

    VM Pod DaemonSet /var/log/<app>/<pod> /var/log fluentbit /var/lib/docker/containers Text logs Stdout logs Input Parsers ES HOST Output ES HOST Indexes KIBANA Search Dashboard
  11. MONITORING NAMESPACE Monitoring in K8S with Prometheus APP NAMESPACE COREOS

    VM Pod DaemonSet /proc Node- exporter /sys PROMETHEUS Scrappers Grafana Dashboard Ingress KUBE-SYSTEM NAMESPACE KSM CAdvisor
  12. Prometheus Scrape jobs - job_name: 'traefik' metrics_path: "/metrics" ec2_sd_configs: -

    region: ap-south-1 port: 8080 relabel_configs: - source_labels: [__meta_ec2_tag_edge] regex: true action: keep - source_labels: [__meta_ec2_tag_environment] regex: stage action: keep
  13. K8S: Resource Quotas resources: limits: cpu: 500m memory: 2500Mi requests:

    cpu: 100m memory: 100Mi Hard upper bound Soft upper bound
  14. Tracing: Logging + Context • Assign UUID to Each Request

    • Context = UUID + metadata • Next Request = Payload + Context • Baggage = Set(K1:V1, K2:V2, ...) • Async capture: ◦ Timing ◦ Events ◦ Tags • Re-create call tree from store A B C D E service = A service = A, service = B service = A, service = C service = A, service = C, Service = D service = A, service = C, Service = E
  15. Trade-offs Logs Metrics Traces Ease of Generation Very Easy, RELP

    Easy but Difficult Processing Overhead Default High, Sampling V3 Invariant, Time variant Benchmarks Vary Ease of Query Moderate Very Easy Easy Information Quality Rich, System Scope Very Rich, System Scope Request Scope, Rich Cost Effectiveness V3 Variant, Low High, Alerts! Higher than Logging
  16. Cloud Native Decisions • Chose a Logging provider that cater

    to your Volume, Variety and Velocity ◦ Logging is a OLAP problem ◦ Ensure provider abides by RELP (Reliable Event Logging Protocol) ◦ ES has indexing overheads that delay log delivery • Prometheus is much better than other TSDBs like Graphite ◦ Taggable metrics ◦ However, not for long term storage ◦ Export to LTS solutions from Prometheus ◦ For short lived / scheduled jobs, use the PushGateway • Don’t use Distributed Tracing unless you have 15-20+ Microservices ◦ You Don’t really need a Service Mesh / ESB when you have a set of 5 services ◦ Use Exception Trackers like Sentry before you think ‘Tracing’ ◦ Go with OpenTracing / Zipkin, before paying for a SaaS solutions
  17. Cloud Native Learnings • Complete automation of Observability isn’t the

    panacea ◦ You need humans to debug and architect ◦ But: ▪ n(team) != n(services) ▪ Complex systems evolve with high velocity ▪ Knowledge of Complex systems evolve amongst practitioners ▪ More humans -> more ambiguity -> more error introduction • Automate Playbook and Dashboard generation ◦ Link playbooks to Dashboard to reduce MLDT • Create an uniform Logging / Monitoring framework across your services ◦ Make each app log the same way, track the same metrics ◦ Triaging becomes uniform across the Org