Slide 14
Slide 14 text
© 2019 Volterra Inc. All Rights Reserved.
Key Findings at Scale - Monitoring
● New Prometheus federation filters to drop unused metrics, labels
○ Initially we had around 50k time series per CE with average of 15 labels.
○ We optimized it to 2k per CE with average
○ Simple while-lists for metric names and black-lists for label names
● Move from global Prometheus federation to Cortex cluster
○ Centralized Prometheus scraped all REs and CEs prometheus,
○ At 1k CE, it becomes unsustainable.
○ Currently Prometheus per RE (federating connected CEs Promethei) with RW to Cortex
● Elasticsearch clusters and logs
○ Decentralized logging architecture
○ Fluentbit as collector on each node forwards logs into Fluentd (aggregator) in RE
○ Elasticsearch deployed in every RE, using remote cluster search to query logs from single
Kibana instance
14