Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cortex: Horizontally Scalable, Highly Available Prometheus Monitoring

Grafana
November 07, 2018

Cortex: Horizontally Scalable, Highly Available Prometheus Monitoring

In this talk we present Cortex - a horizontally scalable, highly available Prometheus implementation. Like Prometheus, Cortex is a CNCF (sandbox) project.

Cortex turns a lot of the Prometheus architectural assumptions on its head, by marrying a scale-out PromQL query engine with a storage layer based on NOSQL databases such as Bigtable, DynamoDB and Cassandra. We have disaggregated the Prometheus binary into a microservices-style architecture, with separate services for query, ingest, alerting and recording rules. By designing all these services as fungible replicas, this solution can be scaled out with ease and failure of any individual replica can be dealt with gracefully.

Grafana

November 07, 2018
Tweet

More Decks by Grafana

Other Decks in Technology

Transcript

  1. Cortex: Horizontally Scalable,
    Highly Available Prometheus
    Tom Wilkie, Nov 2018

    @tom_wilkie

    View Slide

  2. View Slide

  3. Prometheus
    • A monitoring & alerting system.

    • Inspired by Google’s BorgMon

    • Originally built by SoundCloud in 2012

    • Open Source, now part of the CNCF

    • Simple text-based metrics format

    • Multidimensional datamodel

    • Rich, concise query language

    View Slide

  4. Cortex
    • Horizontally scalable Prometheus

    • Distributed, fault tolerant architecture

    • Long term storage

    • Multitenant

    github.com/cortexproject/cortex

    View Slide

  5. 16/06/2016 First design doc

    25/08/2016 PromCon 2016 talk

    25/10/2016 Renamed to Cortex

    23/01/2017 Support for Recording Rules & Alerts

    13/07/2017 BigTable support added

    18/08/2017 PromCon 2017 talk

    08/02/2018 Cassandra support added

    20/09/2018 Join CNCF Sandbox
    http://goo.gl/prdUYV

    View Slide

  6. >2 million samples/s

    >100 million timeseries
    Adopters Users

    View Slide

  7. Community
    • Commits from 37 contributors,
    spanning ~6 companies.

    • Apache 2 license.

    • Community mailing list +
    ~fortnightly call since Feb
    2018.

    • Establishing governance
    based on CNI.


    View Slide

  8. Horizontally Scalable

    Highly Available

    Long Term Storage

    Multitenant

    View Slide

  9. Horizontally Scalable

    View Slide

  10. Prometheus Scaling
    Your Jobs
    Your Jobs
    Your Jobs
    Your Jobs
    Your Apps
    Your Jobs
    Your Jobs
    Your Jobs
    Your Jobs
    Your Apps
    Scale Up
    Your Jobs
    Your Jobs
    Your Jobs
    Your Jobs
    Your Apps
    Your Jobs
    Your Jobs
    Your Jobs
    Your Jobs
    Your Infra
    Manually
    Shard

    View Slide

  11. Cortex

    Distributor
    Cortex

    Ingester
    Cortex

    Ingester
    Cortex

    Ingester
    Cortex

    Ingester
    s
    Cortex Scaling: Distributed Hash Table
    hash(s)
    0
    16
    32
    48

    View Slide

  12. us-central1 eu-west2
    Your
    Jobs
    Your
    Jobs
    Your
    Jobs
    Your
    Jobs
    Your
    Apps
    Your
    Jobs
    Your
    Jobs
    Your
    Jobs
    Your
    Jobs
    Your
    Apps
    Global View
    Can configure multiple
    datasource in Grafana…

    …but then only see data for one
    Prometheus at a time.

    View Slide

  13. us-central1 eu-west2
    Your
    Jobs
    Your
    Jobs
    Your
    Jobs
    Your
    Jobs
    Your
    Apps
    Your
    Jobs
    Your
    Jobs
    Your
    Jobs
    Your
    Jobs
    Your
    Apps
    Global View II
    “global”
    Prometheus
    Can configure a “global”
    Prometheus to federate samples
    from “local” Prometheus….

    …but in practice only propagate
    aggregates, have to preconfigure
    rules, hard to scale etc.

    View Slide

  14. us-central1 eu-west2
    Your
    Jobs
    Your
    Jobs
    Your
    Jobs
    Your
    Jobs
    Your
    Apps
    Your
    Jobs
    Your
    Jobs
    Your
    Jobs
    Your
    Jobs
    Your
    Apps
    Global View III
    “global”
    Cortex
    Or can push all data to a central
    Cortex cluster.

    Cortex horizontal scalability
    allows it to scale to handle all the
    raw samples.

    View Slide

  15. Highly Available

    View Slide

  16. Prometheus HA
    Your Jobs
    Your Jobs
    Your Jobs
    Your Jobs
    Your Apps
    Alertmanager
    Alertmanager

    View Slide

  17. View Slide

  18. Cortex HA: Dynamo-style replication
    Cortex

    Ingester
    Cortex

    Ingester
    Cortex

    Ingester
    Cortex

    Distributor
    s
    Distributor replicates
    samples on ingest.

    Waits for N/2 ACKs
    from ingesters to
    ensure consistency.
    Cortex

    Querier
    s
    Querier de-dupes
    samples on read -
    again, only waiting
    for N/2 responses.

    View Slide

  19. Long Term Storage

    View Slide

  20. durability
    /dʒɔːrəˈbɪlɪti/
    noun
    1. the ability to withstand wear, pressure, or damage.
    “the reliability and durability of plastics"

    View Slide

  21. Durability is hard…
    AWS DynamoDB
    Google Cloud

    Bigtable
    Apache Cassandra
    …let someone else deal with it.

    View Slide

  22. • Why not just write the samples straight to the NOSQL DB?

    • By building & flushing chunks, Cortex acts as a “write deamplifier”,
    massively reducing cost.

    • The NOSQL DBs also don’t necessarily support the right indexes for
    executing PromQL queries. Cortex adds these.
    s
    30k samples/s
    450k series
    ~10 IOPs

    View Slide

  23. Multitenant

    View Slide

  24. Pod-per-tenant
    s
    Auth /
    Frontend

    Automated
    Provisioning
    `
    Multitenant
    s
    Auth /
    Frontend
    Natively multi tenant
    services handle different
    users within the same
    process

    View Slide

  25. Pod-per-tenant
    Multitenant
    Pros
    • No application modifications
    necessary.

    • Effectively zero change of “leakage”
    between tenants.

    Cons
    • Cattle-not-pets

    • Provisioning automation hides a lot of
    complexity…
    Pros
    • Per-tenant marginal costs can be
    close to zero

    • Can take advantage of statistical
    multiplexing.

    • Reduced provisioning complexity can
    be traded for more “interesting”
    architecture.

    Cons
    • Takes work…

    View Slide

  26. Horizontally Scalable

    Highly Available

    Long Term Storage

    Multitenant

    View Slide

  27. • PromCon 2016 talk

    • KubeCon 2016 talk

    • PromCon 2017 talk

    • Original design doc

    • CNCF TOC Presentation

    • Amazon’s Dynamo Paper
    More Reading

    View Slide

  28. Get Involved!
    github.com/cortexproject/cortex

    #cortex on slack.cncf.io

    @tom_wilkie, [email protected]

    View Slide

  29. +
    Grafana Cloud is a hosted and fully managed SaaS metrics
    platform that helps Ops and Dev teams using Grafana
    to understand the behavior of their applications and
    infrastructure
    Grafana Cloud allows users to provision and manage
    the best open source observability tools - Grafana and
    Prometheus - all through a simple UI and single API.
    What is Grafana Cloud?
    Store, visualize and alert without the headache of scaling or managing
    your own monitoring stack.
    Your complete, fully managed, hosted metrics platform.
    Grafana Cloud:

    View Slide