Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Handling High Cardinality in Observability

Handling High Cardinality in Observability

I gave this talk at the "Grafana and Friends" meetup in Bengaluru on 16th September.

Meetup link - https://www.meetup.com/grafana-and-friends-bengaluru/events/295400707/

Points discussed about handling high cardinality.

- Relabel
- Drop Labels
- Split Metrics
- Streaming Aggregations - https://docs.last9.io/docs/streaming-aggregations
- Scale-Out
- Moving high cardinality metrics to separate lake
- Better Workflows like isolation, controls

Learn more about how we solve the cardinality challenges in monitoring and observability systems at - https://last9.io/levitate-tsdb/

Last9's Managed Prometheus solution

What is High Cardinality

Prometheus Cardinality

Streaming Aggregations vs. Recording Rules

How we tame high cardinality in Levitate

How we tame high cardinality with Levitate

Prometheus Downsampling

Prathamesh Sonpatki

September 17, 2023
Tweet

More Decks by Prathamesh Sonpatki

Other Decks in Technology

Transcript

  1. Handling High Cardinality
    in Observability
    Prathamesh Sonpatki
    Last9.io
    1

    View full-size slide

  2. There is a company!
    Who haz customers
    Customers haz campaigns
    Each Campaign → sends notifications
    Each campaign haz multiple destinations
    Each campaign is deployed in a region
    2

    View full-size slide

  3. 👔 Business Ask
    What is the performance for each campaign in
    ap-south-1 region for whatsapp channel? 🤔
    3

    View full-size slide

  4. 🙂 Customer Success Ask
    What is the performance for all campaigns of
    customer Acme Inc? 🤔
    4

    View full-size slide

  5. 󰠺 Product Ask
    What is the performance across all campaigns
    in ap-south-1 region? 🤔
    5

    View full-size slide

  6. 🛠 Engineering Ask
    What is the API performance for top 50
    accounts? 🤔
    6

    View full-size slide

  7. Differing Questions from different personas
    - Business cares about tenancy, campaigns, device types, geos and
    SLAs.
    - Product cares about tenant, channels.
    - Application developers care about services, SLOs, performance.
    - Infrastructure engineers care for infrastructure provisioning per instance
    and spend per tenant/campaign/channel.
    7

    View full-size slide

  8. Differing Questions from different personas
    - Business
    - Product
    - Application
    - Infrastructure
    8
    Data increases!

    View full-size slide

  9. But who can answer all of these questions
    - Business
    - Product
    - Application
    - Infrastructure
    9
    Monitoring Systems

    View full-size slide

  10. Who can answer all of the questions?
    May be a Time Series Database??
    10

    View full-size slide

  11. Metrics Explained - As a Cube
    11

    View full-size slide

  12. Metrics Explained - As a Row
    12

    View full-size slide

  13. Anatomy of a Metric
    13

    View full-size slide

  14. Why Metrics
    - Aggregated
    - Cheaper
    - Can answer all the questions from Infra to Product to Business
    - Real Time
    - Monitoring instead of Debugging
    - Trend Analysis
    - High Level Overview of your subsystems
    14

    View full-size slide

  15. What is the performance for each campaign in
    ap-south-1 region for whatsapp channel? 🤔
    15

    View full-size slide

  16. Labels needed
    - campaign_id
    - tenant_id
    - tenant_name
    - channel
    - region
    - …
    16

    View full-size slide

  17. Cardinality
    - Unique combinations of all of the campaign_id, tenant_id,
    tenant_name, channel, region
    17

    View full-size slide

  18. High Cardinality**
    - Cardinality exceeding for a metric beyond a safer limit
    - Exploding labels
    - Each metric has its own cardinality
    18

    View full-size slide

  19. High Cardinality
    - Cardinality exceeding for a metric beyond a safer limit
    - Exploding labels
    - Each metric has its own cardinality
    That’s 69120
    combinations!!
    Some call this
    Active TimeSeries
    At every reporting
    interval.
    19

    View full-size slide

  20. Why Cardinality is relevant?
    - Answers to questions from Business, Product, App, Infrastructure
    - More answers lead to more questions
    - Real Time Information
    - Labels can pack insights
    - Aggregation
    20

    View full-size slide

  21. Can a TSDB support Infinite Cardinality?
    21

    View full-size slide

  22. Cost of High Cardinality
    - 💸 Money
    - 😥 Toil
    - 📈 Increased Resources
    - 🔥 Burn
    - ❌ Lack of answers
    22

    View full-size slide

  23. Cost of Cardinality is exponential
    23

    View full-size slide

  24. Cardinality needs to be handled
    24

    View full-size slide

  25. Cardinality needs to be handled
    25
    - Legit growth
    - Cardinality Spikes due to some incorrect change

    View full-size slide

  26. Cardinality needs to be handled, but where?
    26

    View full-size slide

  27. Inflight
    Aggregation
    Cardinality
    Limiters
    Usage/Unused
    filters
    Cardinality
    Isolation
    Instrumentation
    Cardinality
    Lakes
    Rollups
    Scale Out
    Retention
    Data Tiering
    Alerting
    Dashboards
    SLOs
    Ingestion Storage Querying
    Too Late &
    Too expensive
    Too Early &
    Too Involved
    Best phase to handle cardinality
    Cardinality needs to be handled, but where?
    27

    View full-size slide

  28. Handling High Cardinality
    - Relabel
    - Drop Labels
    - Split Metrics
    28

    View full-size slide

  29. Handling High Cardinality - Relabel
    - Can be used in case of legit growth
    - During instrumentation phase
    - Affects developers
    - Affects SREs
    - May 🔥 resources at agent
    29

    View full-size slide

  30. Handling High Cardinality - Drop Labels
    - Can be used in case the growth is not legit
    - During instrumentation phase
    - Affects developers
    - Affects SREs
    - Removes ability to get answers 😨
    30

    View full-size slide

  31. Handling High Cardinality - Split Metrics
    - Can be used in case the growth is legit
    - During instrumentation phase
    - Affects developers
    - Affects SREs
    - Affects queries and dashboards 😥
    31

    View full-size slide

  32. Inflight
    Aggregation
    Cardinality
    Limiters
    Usage/Unused
    filters
    Cardinality
    Isolation
    Instrumentation
    Cardinality
    Lakes
    Rollups
    Scale Out
    Retention
    Data Tiering
    Alerting
    Dashboards
    SLOs
    Ingestion Storage Querying
    Too Late &
    Too expensive
    Too Early &
    Too Involved
    Best phase to handle cardinality
    Controls and Workflows to handle High Cardinality
    32

    View full-size slide

  33. Handling High Cardinality - Stream Agg
    - Can be used in case the growth is legit
    - Before Ingestion phase
    - No performance penalty
    - Native PromQL support is bonus
    - Timestamp Awareness is a big plus
    - Affects queries and dashboards
    33

    View full-size slide

  34. Handling High Cardinality - Stream Agg
    34

    View full-size slide

  35. Handling High Cardinality - Isolation
    - Can be used in case the growth is legit
    - On demand
    - No performance penalty
    - Keeps everything else going
    35

    View full-size slide

  36. Handling High Cardinality - Separate Lakes
    - Can be used in case the growth is legit
    - On demand
    - No performance penalty
    - Move high cardinality metrics to new storage
    - No change in query and dashboards and ingestion
    36

    View full-size slide

  37. Handling High Cardinality
    - Relabel
    - Drop Labels
    - Split Metrics
    - Streaming Aggregations
    - Scale Out
    - Moving high cardinality metrics to separate lake
    - Better Workflows like isolation, controls
    37

    View full-size slide

  38. Handling High Cardinality
    - Everything is a tradeoff.
    - Having control over which options to choose is better.
    - Handling legit growth is necessary.
    - Cost trumps all. No free lunch.
    - It is possible to give answers to product, business, app, infra teams from
    metrics monitoring systems.
    38

    View full-size slide

  39. Once Cardinality is not a problem, once
    knowledge is not a problem,
    what would you do with it?
    39

    View full-size slide

  40. But who can answer all of these questions
    - Business
    - Product
    - Application
    - Infrastructure
    40
    Monitoring Systems

    View full-size slide

  41. Prathamesh Sonpatki
    Last9.io
    Srestories.dev
    o11y.wiki
    41

    View full-size slide