Blazin' Fast PromQL

74f012fc80295954988fd18ca289d7d2?s=47 Grafana
August 20, 2019

Blazin' Fast PromQL

(Presented at London Prometheus Meetup 20/09/2019)

PromQL, the Prometheus Query Language, is a concise, powerful and increasingly popular language for querying time series data. But PromQL queries can take a long time when they have to consider >100k series and months of data. Even with Prometheus’ compression, a 90-day query over 200k series can touch ~100GB of data.

In this talk, we will present a series of techniques employed by Cortex (a CNCF project for clustered Prometheus) for accelerating PromQL queries - namely query results caching, time slice parallelisation, aggregation sharding, and automatic recoding rule substitutions.

But there’s more: we will show how you can use this technology to get these improvements with Thanos and Prometheus.

74f012fc80295954988fd18ca289d7d2?s=128

Grafana

August 20, 2019
Tweet

Transcript

  1. Blazin’ Fast PromQL Prometheus London Meetup, August 2019 @tom_wilkie

  2. Cortex is a time-series store built on Prometheus that is:

    - Horizontally scalable - Highly Available - Long-term storage - Multi-tenant - Multi-tenant Cortex: horizontally scalable Prometheus 2 Cortex gives you: - A global view of as many metrics as you need - With no gaps in the charts - On durable, long term storage - Across multiple tenants Cortex is a CNCF Sandbox project: github.com/cortexproject/cortex
  3. 3 Querier PromQL Engine Chunk Store Ingester Client Ingester Ingester

    Ingester Ingester Ingester >1 yr ago NoSQL Index Blob Store
  4. 4 Querier PromQL Engine Chunk Store Ingester Client Ingester Ingester

    Ingester Ingester Ingester Index Memcached Chunk Memcached NoSQL Index Blob Store Caching
  5. 5 Querier Ingester Ingester Ingester Ingester Ingester Index Memcached Chunk

    Memcached NoSQL Index Blob Store Query Frontend Results Memcached More Caching
  6. 6 Query Frontend rate(http_duration_seconds_count{job="shipping"}[1m]) rate... rate... rate... rate... 2. Split

    by day rate... rate... 3. Cache lookup .. 4. Queue & Parallel Dispatch rate(request_durations_seconds_count[1m]) 1. Step align rate(http_duration_seconds_count{job="shipping"}[1m])
  7. But wait! One more thing... 7

  8. 8 https://github.com/cortexproject/cortex/pull/1441

  9. 9 $ ./cortex \ -config.file=./docs/prometheus-frontend.yml \ -frontend.downstream-url=http://demo.robustperception.io:9090 ... Try this

    query over 7 days: histogram_quantile(0.50, sum by (job, le) ( rate(prometheus_http_request_duration_seconds_bucket[1m]) ) )
  10. - Start sharding aggregations by series to accelerate high-cardinality queries

    (design doc). - Automatically replace with recording rules where appropriate? - Embed this as a library in Thanos... - Handle gaps from HA pairs... - What do you want to see? What does the future hold? 10
  11. Thank You! 11 @tom_wilkie https://github.com/cortexproject/cortex

  12. How do we compare to Trickster? - Reusable: set of

    HTTP middlewares, useable as a library - Memcached for “external” cache (vs Redis for Trickster) - We are multi-tenant - We split by day and execute in parallel - We have some rudimentary QOS / queueing / scheduling However: - No “Fast Forward” like Trickster - Trickster more widely used 12 https://github.com/Comcast/trickster