Monitoring SLI - Service Level Indicator Depends on the team: • ops: cpu, memory, disk io, networking, nodes available, pods running, messages on queue
Monitoring SLI - Service Level Indicator Depends on the team: • ops: cpu, memory, disk io, networking, nodes available, pods running, messages on queue • devs, response time, requests per second
Monitoring SLI - Service Level Indicator Depends on the team: • ops: cpu, memory, disk io, networking, nodes available, pods running, messages on queue • devs, response time, requests per second • data engineers, time to run an ETL job, how many data are been processed, the freshness of the data
Monitoring SLO - Service Level Objectives • We should involve costumer to help define it • Breaches must alert the team • Use realistic objectives • Reevaluate the values periodically
Monitoring SLA - Service Level Agreement • Should be higher than the SLO. When SLO breaches it must alerts before SLA breaches • Pay attention on the agreement and honor it
Prometheus From the Greek Promēthéus, "forethought". He is a titan (second generation), son of Iapetus (son of Uranus; an incest between Uranus and Gaia) and brother of Atlas, Epimetheus and Menoetius. He was a defender of humanity, responsible for stealing Hestia's fire and give it to mortals.
Prometheus • Metrics platform • Started in 2012 at SoundCloud • Opensourced and published in 2015 • Second project under CNCF (Cloud Native Computing Foundation)
Prometheus • Metrics platform • Started in 2012 at SoundCloud • Opensourced and published in 2015 • Second project under CNCF (Cloud Native Computing Foundation) • Can also fire and manage alerts
Prometheus • Metrics platform • Started in 2012 at SoundCloud • Opensourced and published in 2015 • Second project under CNCF (Cloud Native Computing Foundation) • Can also fire and manage alerts • Stores metric in time series database (TSDB)
Prometheus • Pull based model (scale the exporter) • Good for telemetry metrics and statistical metrics • Known alternatives: graphite / collectd / carbon, zabbix (all push based)
Prometheus Advantages: • Written in Go lang • Http based communication • Service discover integration (kubernetes, Swarm, Consul, AWS, GCP, etc) • Dashboard for alerts management
Prometheus Advantages: • Written in Go lang • Http based communication • Service discover integration (kubernetes, Swarm, Consul, AWS, GCP, etc) • Dashboard for alerts management • Dashboard for query debugging
Tips To set up a counter $registry = \Prometheus\CollectorRegistry::getDefault(); $counter = $registry- >getOrRegisterCounter('demo', 'visitor_counter', 'it increases', ['type']); $counter->incBy(3, ['blue']);
Tips To set up a gauge $registry = \Prometheus\CollectorRegistry::getDefault(); $gauge = $registry->getOrRegisterGauge('demo', 'score', 'it sets', ['type']); $gauge->set(2.5, ['blue']);
Tips To show the metrics to be scraped $registry = \Prometheus\CollectorRegistry::getDefault(); $renderer = new RenderTextFormat(); $result = $renderer->render( $registry->getMetricFamilySamples() ); header('Content-type: ' . RenderTextFormat::MIME_TYPE); echo $result;
Tips Starts with RED method Set up the following query (ud:itentity:rate_10m < bool 1000) * 100 + (ud:error:percent_10m > bool 1.5) * 10 + (ud:read:duration_p99_10m < bool 25) * 1
Tips Define a dashboard in Grafana that maps the following results: 111 = x Rate, x Errors, x Duration 110 = x Rate, x Errors 101 = x Rate, x Duration 100 = x Rate 011 = x Errors, x Duration 010 = x Errors 001 = x Duration 000 = Ok