[Kubecon Europe 2023] Ingesting 6.5 Tb of Telemetry Data Daily Through Open Telemetry Protocol and Collectors

Ingesting 6.5 Tb of Telemetry Data Daily Through Open Telemetry
Protocol and Collectors Gustavo Pantuza

About me Personal blog on Computer Science [pt-BR] https://blog.pantuza.com @gpantuza
pantuza

About VTEX platform Problem Solution + Outcomes Architecture Agenda Resilience

About VTEX What kind of system are we observing? 01

Build, manage, and deliver proﬁtable ecommerce businesses with more agility
and less risk. The Enterprise Digital Commerce Platform VTEX context

VTEX at a glance 3.200+ Active online stores 38 Countries
with active customers Public-listed company

• LONDON • BARCELONA • BUCUREȘTI WHERE WE ARE Locations
across the globe 18 Global Platform • BUENOS AIRES SANTIAGO • BOGOTÁ • • CIUDAD DE MÉXICO • NEW YORK • JOÃO PESSOA • MEDELLÍN MILAN • • RIO DE JANEIRO SÃO PAULO • LIMA • LISBON • • SINGAPORE Employees 1.3 k • RECIFE As of Q2/21 ended on June 30th, 2021 PARIS •

Problem Where is the problem within this context? 01

Problem

Inefficient ingestion control HTTP 1.1 with no encryption Unstructured telemetry
data No common fields by design Problem Too many library implementations Telemetry data governance Inefficient way to store KPIs Single vendor for different telemetry signals

Problem How to evolve to a long-term o11y stack without
vendor lock-in while improving o11y eﬃciency?

Solution + Outcomes How did we solve the problem and
what are the outcomes? 01

Solution + Outcomes

Solution + Outcomes Open Telemetry Protocol on every possible layer
Common libraries with the same interfaces Open Telemetry Collectors as Telemetry Ingestor Diﬀerent data sinks to diﬀerent telemetry signals Sharded architecture by Telemetry signal

41% reduction on Observability investments Long Term Architecture that does
not require developers to migrate their applications in case we change o11y vendors 6.5 Tb of telemetry data getting ingested per day and with control knobs such as dynamic sampling Solution + Outcomes

Architecture What is the project design? 01

Architecture Resiliency control Asynchronous requests Head based sampling Shared library
(Diagnostics) Facilitates the migration

Architecture gRPC communication Encryption Common methods interface Shared library (Diagnostics)
/** * VTEX's Telemetry methods */ service Telemetry { /* Logs related methods */ rpc Info(LogRequest) returns (google.protobuf.Empty); rpc Warn(LogRequest) returns (google.protobuf.Empty); rpc Error(LogRequest) returns (google.protobuf.Empty); rpc Debug(LogRequest) returns (google.protobuf.Empty); /* Metrics related methods */ rpc SystemMetric(Metric) returns (google.protobuf.Empty); rpc BusinessMetric(Metric) returns (google.protobuf.Empty); /* Traces related methods */ rpc Trace(Trace) returns (google.protobuf.Empty); }

Architecture Structured logs by design Common ﬁelds Shared library (Diagnostics)
/** * Common Fields on telemetry data */ message Common { ... /* Name of the service that is sending telemetry data */ string service_name = 1; /* Instance hostname */ string instance_id = 2; /* Instance Availability zone */ string az = 3; /* Instance region */ string region = 4; /* Optional hash table allowing users to send extra fields */ map<string, string> extra_fields = 5; ... } Instrumented with Open Telemetry oﬃcial libraries

Architecture Single telemetry signal architecture

Architecture Extended with our modules Built internally Diﬀerent conﬁgurations per
telemetry signal Custom Collectors Open Telemetry Collectors Builder (ocb)

dist: name: otelcol description: OpenTelemetry Collector version: 0.xx.y otelcol_version: 0.xx.y
receivers: - ... exporters: - ... extensions: - ... processors: - ... Architecture Extended with our modules Built internally Diﬀerent conﬁgurations per telemetry signal Custom Collectors builder.yaml

Architecture Telemetry data ﬂow

Architecture Reusable dashboards Central place for dashboards Specialized visualizations on
each data sink Data visualization Teams governance

Architecture

4 Terabytes of logs per day 150 millions Active time
series 2.15 Billions of events ingested per day Architecture

Resilience How do we keep this system up and running?
01

“Fail locally, not globally” Resilience

Resilience

Horizontal Pod auto-scaling Burst requests on a single deployment can
increase signiﬁcantly collectors load autoscaling: enabled: true minReplicas: 10 maxReplicas: 20 targetMemoryUtilizationPercentage: 90 targetCPUUtilizationPercentage: 60 One Example of single Deployment Resilience

Sharding strategy Sharded environments by business criteria such as core
systems vs internal services Shard 0 Shard 1 Shard 2 Shard 3 Logs Metrics Traces Resilience

settings: default_sampling_percentage: 25 skip_sampling_field: debug services_config: sampling: - name: service_0
index: service_0 percentage: 75 - name: service_1 index: service_1 percentage: 0 Default sampling Sampling % per service Skip sampling Real time Sampling (tail based sampling) Resilience Custom Open Telemetry Collector Processor

wal: s3_region: some-aws-region s3_bucket: "wal-bucket-name" flush: bytes: 10000000 interval: 120
cluster: shard0 Data Backﬁll one week data expiration AWS S3 + Lambda functions Write Ahead Log (WAL) Resilience Custom Open Search exporter

We monitor all collectors from all pipelines and shards. Based
on this monitoring we trigger alerts that pages the team OnCall engineer. Resilience Alerting

Migration tips RFC like process to engage engineering teams C
levels buy-in on the project Understand your client (speak to your teams) Engage your vendors on the resiliency plan Find early adopters. One step at a time

VTEX Context We saw the multi-tenant architecture Problem We understood
the problem of using a single vendor for diﬀerent telemetry data Solution + Outcomes Before jumping on details we saw the overall approach and direct outcomes Architecture Then we jumped into details of the long term architecture and system design Recap Resilience Finally we discussed strategies to avoid global outages and how to monitor the entire architecture

Open Telemetry ecosystem enabled VTEX to innovate fast and eﬃciently

Personal blog on Computer Science [pt-BR] https://blog.pantuza.com @gpantuza pantuza

[Kubecon Europe 2023] Ingesting 6.5 Tb of Telem...

[Kubecon Europe 2023] Ingesting 6.5 Tb of Telemetry Data Daily Through Open Telemetry Protocol and Collectors

More Decks by Gustavo Pantuza

Other Decks in Programming

Featured

Transcript