Cortex: Horizontally Scalable, Highly Available Prometheus Monitoring

Slide 1

Slide 1 text

Cortex: Horizontally Scalable, Highly Available Prometheus Tom Wilkie, Nov 2018 @tom_wilkie

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Prometheus • A monitoring & alerting system. • Inspired by Google’s BorgMon • Originally built by SoundCloud in 2012 • Open Source, now part of the CNCF • Simple text-based metrics format • Multidimensional datamodel • Rich, concise query language

Slide 4

Slide 4 text

Cortex • Horizontally scalable Prometheus • Distributed, fault tolerant architecture • Long term storage • Multitenant github.com/cortexproject/cortex

Slide 5

Slide 5 text

16/06/2016 First design doc 25/08/2016 PromCon 2016 talk 25/10/2016 Renamed to Cortex 23/01/2017 Support for Recording Rules & Alerts 13/07/2017 BigTable support added 18/08/2017 PromCon 2017 talk 08/02/2018 Cassandra support added 20/09/2018 Join CNCF Sandbox http://goo.gl/prdUYV

Slide 6

Slide 6 text

>2 million samples/s >100 million timeseries Adopters Users

Slide 7

Slide 7 text

Community • Commits from 37 contributors, spanning ~6 companies. • Apache 2 license. • Community mailing list + ~fortnightly call since Feb 2018. • Establishing governance based on CNI. 

Slide 8

Slide 8 text

Horizontally Scalable Highly Available Long Term Storage Multitenant

Slide 9

Slide 9 text

Horizontally Scalable

Slide 10

Slide 10 text

Prometheus Scaling Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Scale Up Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Your Jobs Your Jobs Your Jobs Your Jobs Your Infra Manually Shard

Slide 11

Slide 11 text

Cortex  Distributor Cortex  Ingester Cortex  Ingester Cortex  Ingester Cortex  Ingester s Cortex Scaling: Distributed Hash Table hash(s) 0 16 32 48

Slide 12

Slide 12 text

us-central1 eu-west2 Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Global View Can conﬁgure multiple datasource in Grafana… …but then only see data for one Prometheus at a time.

Slide 13

Slide 13 text

us-central1 eu-west2 Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Global View II “global” Prometheus Can conﬁgure a “global” Prometheus to federate samples from “local” Prometheus…. …but in practice only propagate aggregates, have to preconﬁgure rules, hard to scale etc.

Slide 14

Slide 14 text

us-central1 eu-west2 Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Global View III “global” Cortex Or can push all data to a central Cortex cluster. Cortex horizontal scalability allows it to scale to handle all the raw samples.

Slide 15

Slide 15 text

Highly Available

Slide 16

Slide 16 text

Prometheus HA Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Alertmanager Alertmanager

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Cortex HA: Dynamo-style replication Cortex  Ingester Cortex  Ingester Cortex  Ingester Cortex  Distributor s Distributor replicates samples on ingest. Waits for N/2 ACKs from ingesters to ensure consistency. Cortex  Querier s Querier de-dupes samples on read - again, only waiting for N/2 responses.

Slide 19

Slide 19 text

Long Term Storage

Slide 20

Slide 20 text

durability /dʒɔːrəˈbɪlɪti/ noun 1. the ability to withstand wear, pressure, or damage. “the reliability and durability of plastics"

Slide 21

Slide 21 text

Durability is hard… AWS DynamoDB Google Cloud Bigtable Apache Cassandra …let someone else deal with it.

Slide 22

Slide 22 text

• Why not just write the samples straight to the NOSQL DB? • By building & ﬂushing chunks, Cortex acts as a “write deampliﬁer”, massively reducing cost. • The NOSQL DBs also don’t necessarily support the right indexes for executing PromQL queries. Cortex adds these. s 30k samples/s 450k series ~10 IOPs

Slide 23

Slide 23 text

Multitenant

Slide 24

Slide 24 text

Pod-per-tenant s Auth / Frontend … Automated Provisioning ` Multitenant s Auth / Frontend Natively multi tenant services handle diﬀerent users within the same process

Slide 25

Slide 25 text

Pod-per-tenant Multitenant Pros • No application modiﬁcations necessary. • Eﬀectively zero change of “leakage” between tenants. Cons • Cattle-not-pets • Provisioning automation hides a lot of complexity… Pros • Per-tenant marginal costs can be close to zero • Can take advantage of statistical multiplexing. • Reduced provisioning complexity can be traded for more “interesting” architecture. Cons • Takes work…

Slide 26

Slide 26 text

Horizontally Scalable Highly Available Long Term Storage Multitenant

Slide 27

Slide 27 text

• PromCon 2016 talk • KubeCon 2016 talk • PromCon 2017 talk  • Original design doc • CNCF TOC Presentation • Amazon’s Dynamo Paper More Reading

Slide 28

Slide 28 text

Get Involved! github.com/cortexproject/cortex #cortex on slack.cncf.io @tom_wilkie, [email protected]

Slide 29

Slide 29 text

+ Grafana Cloud is a hosted and fully managed SaaS metrics platform that helps Ops and Dev teams using Grafana to understand the behavior of their applications and infrastructure Grafana Cloud allows users to provision and manage the best open source observability tools - Grafana and Prometheus - all through a simple UI and single API. What is Grafana Cloud? Store, visualize and alert without the headache of scaling or managing your own monitoring stack. Your complete, fully managed, hosted metrics platform. Grafana Cloud: