Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Multi-Tenant Observability Pipeline with OpenTelemetry

Building a Multi-Tenant Observability Pipeline with OpenTelemetry

An walkthrough of a multi-tenant observability pipeline using OpenTelemetry, Jaeger, Prometheus, Kafka and Cassandra

Joy Bhattacherjee

November 12, 2020

More Decks by Joy Bhattacherjee

Other Decks in Technology


  1. The Three Pillars, a Taxonomy Logs Metrics Traces Plantext Structured

    Binary RED USE SLI SLO Violation Alerting Playbooks Recovery Tracing Exception Handling Debugging Profiling RCA RCA Audit Anomaly Capacity
  2. Distributed Tracing: Logging + Context • Assign UUID to Each

    Request • Context = UUID + metadata • Next Request = Payload + Context • Baggage = Set(K1:V1, K2:V2, ...) • Async capture: ◦ Timing ◦ Events ◦ Tags • Re-create call tree from store A B C D E service = A service = A, service = B service = A, service = C service = A, service = C, Service = D service = A, service = C, Service = E
  3. Why We Chose OpenTelemetry • Three Pillars under one roof

    • Vendor Neutral Data Format • Easy Interoperability • Plugin system • Ability to do arbitrary processing of data without touching other components ◦ Custom trace processor to generate metrics
  4. green core components red contrib components The codebase needs to

    be re-compiled if you want to include assorted contrib components Only pre-compiled components can later be referenced in a pipeline
  5. • Stage 01: ◦ Client Data Emulation: ▪ Load-generator ◦

    Client-side Agent ▪ Otel-agent ◦ Server Side Consumer and Data Transformer ▪ Otel-collector ◦ Server Side Data Sink ▪ We’ll get to this...
  6. Pipeline definition receivers: opencensus: zipkin: endpoint: jaeger: protocols: thrift_http:

    prometheus: config: scrape_configs: - job_name: 'load_generator_app' scrape_interval: 3s static_configs: - targets: ['load-generator:9001'] exporters: opencensus: endpoint: "otel-collector:55678" insecure: true logging: loglevel: debug processors: batch: queued_retry: service: pipelines: traces: receivers: [opencensus, jaeger, zipkin] processors: [batch, queued_retry] exporters: [opencensus, logging] metrics: receivers: [opencensus, prometheus] exporters: [logging,opencensus]
  7. Tail Sampling processors: tail_sampling: decision_wait: 10s num_traces: 100 expected_new_traces_per_sec: 10

    policies: [ { name: sampleNoErrors, type: numeric_attribute, numeric_attribute: {key: status.code, min_value: 0, max_value: 0} }, { name: sample200, type: string_attribute, string_attribute: {key: http.status_code, values: ["200"]} }, { name: ratelimit35, type: rate_limiting, rate_limiting: {spans_per_second: 35} } ]
  8. • Stage 02: ◦ Per tenant data sinks ▪ Metrics

    • Prometheus CR ▪ Traces • Jaeger CR • Streaming mode with kafka ◦ Long Term Storage (common) ▪ Metrics • Cortex ◦ Cassandra ▪ Traces • Jaeger Streaming with Cassandra backend ◦ Unified Data Sink ▪ Cassandra Keyspaces • tenant-N-trace • tenant-N-metrics ▪ Now, we can build a query api layer • Requirements ◦ Prometheus Operator ◦ Jaeger Operator ◦ Kafka Operator ◦ Cassandra ▪ bitnami-cassandra ▪ Scylla-operator ◦ Cortex ▪ Consul
  9. • Kubernetes Namespace Level Multi-tenancy ◦ Each client’s data flows

    through their own namespace ◦ Only final data sink clusters are shared ▪ Isolation based on separate data stores that can only be accessed with Auth Headers • Isolation Implementation ◦ Keep the data consumer and exporter data sinks in one isolated tenant namespace per client ◦ Secure with Kubernetes RBAC, Pod Security Policies ◦ Ensure cross-namespace data scrapes don’t happen on prometheus through tenant labels and Network Policies Namespace level Tenant Isolation RBAC: { SA, Role, RoleBinding }, PSP, NetPols Prometheus Otel collector Jaeger Kafka-topic Common Data Pipeline Namespace Cortex Cassandra Tenant metrics keyspace Tenant traces keyspace Kafka
  10. References • https://storage.googleapis.com/pub-tools-public-publicati on-data/pdf/36356.pdf • https://opensource.googleblog.com/2018/01/opencensus .html • https://blog.twitter.com/engineering/en_us/a/2012/distri buted-systems-tracing-with-zipkin.html

    • https://eng.uber.com/distributed-tracing/ • https://medium.com/opentracing/towards-turnkey-distri buted-tracing-5f4297d1736 • https://medium.com/@AloisReitbauer/trace-context-and- the-road-toward-trace-tool-interoperability-d4d5693236 9c • https://medium.com/opentracing/merging-opentracing-and- opencensus-f0fe9c7ca6f0 • https://github.com/open-telemetry/opentelemetry-collec tor • https://github.com/open-telemetry/opentelemetry-specifi cation https://pastebin.com/8dYNk0sR