Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improving Observability with Prometheus (Darkmi...

Improving Observability with Prometheus (Darkmira Tour PHP 2020)

Talk presented online on December 13th at Darkmira Tour PHP 2020 https://php.darkmiratour.rocks/2020/schedule.html Demo available at https://github.com/wsilva/darkmira-prometheus-php-demo .

Wellington F. Silva

December 13, 2020
Tweet

More Decks by Wellington F. Silva

Other Decks in Technology

Transcript

  1. Wellington F. Silva contact: @_wsilva nicks: wsilva, boina, tom, fisi*

    Roles: pai, marido, tec. telecom, programador, sysadmin, docker community leader, instrutor, escritor, zend certified engineer e docker certified associate, certified kubernetes administrator * in deprecation
  2. Observability 3 pilars: • Metrics • Logging • Tracing •

    Events (kind of new) - MELT, or 4 golden signals
  3. Observability • Better deployments • Improve time to market •

    Less toil • Avoid premature optimisation
  4. Observability • Better deployments • Improve time to market •

    Less toil • Avoid premature optimisation • Improve resource utilisation
  5. Observability • Better deployments • Improve time to market •

    Less toil • Avoid premature optimisation • Improve resource utilisation • Lower costs
  6. Observability • Demand effort on coding and configuring • Could

    extends time to delivery • Constant neglected
  7. Monitoring • Subset of observability • Show points where we

    start to dig • Makes it easier and faster to find bottlenecks
  8. Monitoring SLI - Service Level Indicator Depends on the team:

    • ops: cpu, memory, disk io, networking, nodes available, pods running, messages on queue
  9. Monitoring SLI - Service Level Indicator Depends on the team:

    • ops: cpu, memory, disk io, networking, nodes available, pods running, messages on queue • devs, response time, requests per second
  10. Monitoring SLI - Service Level Indicator Depends on the team:

    • ops: cpu, memory, disk io, networking, nodes available, pods running, messages on queue • devs, response time, requests per second • data engineers, time to run an ETL job, how many data are been processed, the freshness of the data
  11. Monitoring SLO - Service Level Objectives • We should involve

    costumer to help define it • Breaches must alert the team
  12. Monitoring SLO - Service Level Objectives • We should involve

    costumer to help define it • Breaches must alert the team • Use realistic objectives
  13. Monitoring SLO - Service Level Objectives • We should involve

    costumer to help define it • Breaches must alert the team • Use realistic objectives • Reevaluate the values periodically
  14. Monitoring SLA - Service Level Agreement • Should be higher

    than the SLO. When SLO breaches it must alerts before SLA breaches
  15. Monitoring SLA - Service Level Agreement • Should be higher

    than the SLO. When SLO breaches it must alerts before SLA breaches • Pay attention on the agreement and honor it
  16. Prometheus From the Greek Promēthéus, "forethought". He is a titan

    (second generation), son of Iapetus (son of Uranus; an incest between Uranus and Gaia) and brother of Atlas, Epimetheus and Menoetius. He was a defender of humanity, responsible for stealing Hestia's fire and give it to mortals.
  17. Prometheus • Metrics platform • Started in 2012 at SoundCloud

    • Opensourced and published in 2015 • Second project under CNCF (Cloud Native Computing Foundation)
  18. Prometheus • Metrics platform • Started in 2012 at SoundCloud

    • Opensourced and published in 2015 • Second project under CNCF (Cloud Native Computing Foundation) • Can also fire and manage alerts
  19. Prometheus • Metrics platform • Started in 2012 at SoundCloud

    • Opensourced and published in 2015 • Second project under CNCF (Cloud Native Computing Foundation) • Can also fire and manage alerts • Stores metric in time series database (TSDB)
  20. Prometheus • Pull based model (scale the exporter) • Good

    for telemetry metrics and statistical metrics
  21. Prometheus • Pull based model (scale the exporter) • Good

    for telemetry metrics and statistical metrics • Known alternatives: graphite / collectd / carbon, zabbix (all push based)
  22. Prometheus Disadvantages: • Not too easy to horizontal scale •

    No query cache • PromQL instead of regular SQL
  23. Prometheus Advantages: • Written in Go lang • Http based

    communication • Service discover integration (kubernetes, Swarm, Consul, AWS, GCP, etc)
  24. Prometheus Advantages: • Written in Go lang • Http based

    communication • Service discover integration (kubernetes, Swarm, Consul, AWS, GCP, etc) • Dashboard for alerts management
  25. Prometheus Advantages: • Written in Go lang • Http based

    communication • Service discover integration (kubernetes, Swarm, Consul, AWS, GCP, etc) • Dashboard for alerts management • Dashboard for query debugging
  26. Prometheus Advantages: • Multidimensional data model • Easy to set

    up with Grafana • PromQL ( kind of functional style, power for calculation)
  27. Tips To set up a counter $registry = \Prometheus\CollectorRegistry::getDefault(); $counter

    = $registry- >getOrRegisterCounter('demo', 'visitor_counter', 'it increases', ['type']); $counter->incBy(3, ['blue']);
  28. Tips To set up a gauge $registry = \Prometheus\CollectorRegistry::getDefault(); $gauge

    = $registry->getOrRegisterGauge('demo', 'score', 'it sets', ['type']); $gauge->set(2.5, ['blue']);
  29. Tips To set up an histogram $registry = \Prometheus\CollectorRegistry::getDefault(); $histogram

    = $registry- >getOrRegisterHistogram('demo', ‘secs_bucket', 'it observes', ['type'], [0.1, 1, 2, 3.5, 4, 5, 6, 7, 8, 9]); $histogram->observe(3.5, ['blue']);
  30. Tips To show the metrics to be scraped $registry =

    \Prometheus\CollectorRegistry::getDefault(); $renderer = new RenderTextFormat(); $result = $renderer->render( $registry->getMetricFamilySamples() ); header('Content-type: ' . RenderTextFormat::MIME_TYPE); echo $result;
  31. Tips Starts with RED method Set up the following query

    (ud:itentity:rate_10m < bool 1000) * 100 + (ud:error:percent_10m > bool 1.5) * 10 + (ud:read:duration_p99_10m < bool 25) * 1
  32. Tips Define a dashboard in Grafana that maps the following

    results: 111 = x Rate, x Errors, x Duration 110 = x Rate, x Errors 101 = x Rate, x Duration 100 = x Rate 011 = x Errors, x Duration 010 = x Errors 001 = x Duration 000 = Ok