Slide 1

Slide 1 text

Ingesting 6.5 Tb of Telemetry Data Daily Through Open Telemetry Protocol and Collectors Gustavo Pantuza

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

About me Personal blog on Computer Science [pt-BR] https://blog.pantuza.com @gpantuza pantuza

Slide 4

Slide 4 text

About VTEX platform Problem Solution + Outcomes Architecture Agenda Resilience

Slide 5

Slide 5 text

About VTEX What kind of system are we observing? 01

Slide 6

Slide 6 text

Build, manage, and deliver profitable ecommerce businesses with more agility and less risk. The Enterprise Digital Commerce Platform VTEX context

Slide 7

Slide 7 text

VTEX at a glance 3.200+ Active online stores 38 Countries with active customers Public-listed company

Slide 8

Slide 8 text

• LONDON • BARCELONA • BUCUREȘTI WHERE WE ARE Locations across the globe 18 Global Platform • BUENOS AIRES SANTIAGO • BOGOTÁ • • CIUDAD DE MÉXICO • NEW YORK • JOÃO PESSOA • MEDELLÍN MILAN • • RIO DE JANEIRO SÃO PAULO • LIMA • LISBON • • SINGAPORE Employees 1.3 k • RECIFE As of Q2/21 ended on June 30th, 2021 PARIS •

Slide 9

Slide 9 text

Problem Where is the problem within this context? 01

Slide 10

Slide 10 text

Problem

Slide 11

Slide 11 text

Inefficient ingestion control HTTP 1.1 with no encryption Unstructured telemetry data No common fields by design Problem Too many library implementations Telemetry data governance Inefficient way to store KPIs Single vendor for different telemetry signals

Slide 12

Slide 12 text

Problem How to evolve to a long-term o11y stack without vendor lock-in while improving o11y efficiency?

Slide 13

Slide 13 text

Solution + Outcomes How did we solve the problem and what are the outcomes? 01

Slide 14

Slide 14 text

Solution + Outcomes

Slide 15

Slide 15 text

Solution + Outcomes Open Telemetry Protocol on every possible layer Common libraries with the same interfaces Open Telemetry Collectors as Telemetry Ingestor Different data sinks to different telemetry signals Sharded architecture by Telemetry signal

Slide 16

Slide 16 text

41% reduction on Observability investments Long Term Architecture that does not require developers to migrate their applications in case we change o11y vendors 6.5 Tb of telemetry data getting ingested per day and with control knobs such as dynamic sampling Solution + Outcomes

Slide 17

Slide 17 text

Architecture What is the project design? 01

Slide 18

Slide 18 text

Architecture Resiliency control Asynchronous requests Head based sampling Shared library (Diagnostics) Facilitates the migration

Slide 19

Slide 19 text

Architecture gRPC communication Encryption Common methods interface Shared library (Diagnostics) /** * VTEX's Telemetry methods */ service Telemetry { /* Logs related methods */ rpc Info(LogRequest) returns (google.protobuf.Empty); rpc Warn(LogRequest) returns (google.protobuf.Empty); rpc Error(LogRequest) returns (google.protobuf.Empty); rpc Debug(LogRequest) returns (google.protobuf.Empty); /* Metrics related methods */ rpc SystemMetric(Metric) returns (google.protobuf.Empty); rpc BusinessMetric(Metric) returns (google.protobuf.Empty); /* Traces related methods */ rpc Trace(Trace) returns (google.protobuf.Empty); }

Slide 20

Slide 20 text

Architecture Structured logs by design Common fields Shared library (Diagnostics) /** * Common Fields on telemetry data */ message Common { ... /* Name of the service that is sending telemetry data */ string service_name = 1; /* Instance hostname */ string instance_id = 2; /* Instance Availability zone */ string az = 3; /* Instance region */ string region = 4; /* Optional hash table allowing users to send extra fields */ map extra_fields = 5; ... } Instrumented with Open Telemetry official libraries

Slide 21

Slide 21 text

Architecture Single telemetry signal architecture

Slide 22

Slide 22 text

Architecture Extended with our modules Built internally Different configurations per telemetry signal Custom Collectors Open Telemetry Collectors Builder (ocb)

Slide 23

Slide 23 text

dist: name: otelcol description: OpenTelemetry Collector version: 0.xx.y otelcol_version: 0.xx.y receivers: - ... exporters: - ... extensions: - ... processors: - ... Architecture Extended with our modules Built internally Different configurations per telemetry signal Custom Collectors builder.yaml

Slide 24

Slide 24 text

Architecture Telemetry data flow

Slide 25

Slide 25 text

Architecture Reusable dashboards Central place for dashboards Specialized visualizations on each data sink Data visualization Teams governance

Slide 26

Slide 26 text

Architecture

Slide 27

Slide 27 text

4 Terabytes of logs per day 150 millions Active time series 2.15 Billions of events ingested per day Architecture

Slide 28

Slide 28 text

Resilience How do we keep this system up and running? 01

Slide 29

Slide 29 text

“Fail locally, not globally” Resilience

Slide 30

Slide 30 text

Resilience

Slide 31

Slide 31 text

Horizontal Pod auto-scaling Burst requests on a single deployment can increase significantly collectors load autoscaling: enabled: true minReplicas: 10 maxReplicas: 20 targetMemoryUtilizationPercentage: 90 targetCPUUtilizationPercentage: 60 One Example of single Deployment Resilience

Slide 32

Slide 32 text

Sharding strategy Sharded environments by business criteria such as core systems vs internal services Shard 0 Shard 1 Shard 2 Shard 3 Logs Metrics Traces Resilience

Slide 33

Slide 33 text

settings: default_sampling_percentage: 25 skip_sampling_field: debug services_config: sampling: - name: service_0 index: service_0 percentage: 75 - name: service_1 index: service_1 percentage: 0 Default sampling Sampling % per service Skip sampling Real time Sampling (tail based sampling) Resilience Custom Open Telemetry Collector Processor

Slide 34

Slide 34 text

wal: s3_region: some-aws-region s3_bucket: "wal-bucket-name" flush: bytes: 10000000 interval: 120 cluster: shard0 Data Backfill one week data expiration AWS S3 + Lambda functions Write Ahead Log (WAL) Resilience Custom Open Search exporter

Slide 35

Slide 35 text

We monitor all collectors from all pipelines and shards. Based on this monitoring we trigger alerts that pages the team OnCall engineer. Resilience Alerting

Slide 36

Slide 36 text

Migration tips RFC like process to engage engineering teams C levels buy-in on the project Understand your client (speak to your teams) Engage your vendors on the resiliency plan Find early adopters. One step at a time

Slide 37

Slide 37 text

VTEX Context We saw the multi-tenant architecture Problem We understood the problem of using a single vendor for different telemetry data Solution + Outcomes Before jumping on details we saw the overall approach and direct outcomes Architecture Then we jumped into details of the long term architecture and system design Recap Resilience Finally we discussed strategies to avoid global outages and how to monitor the entire architecture

Slide 38

Slide 38 text

Open Telemetry ecosystem enabled VTEX to innovate fast and efficiently

Slide 39

Slide 39 text

Personal blog on Computer Science [pt-BR] https://blog.pantuza.com @gpantuza pantuza