Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improving Observability with Prometheus (Darkmira Tour PHP 2020)

Improving Observability with Prometheus (Darkmira Tour PHP 2020)

Talk presented online on December 13th at Darkmira Tour PHP 2020 https://php.darkmiratour.rocks/2020/schedule.html Demo available at https://github.com/wsilva/darkmira-prometheus-php-demo .

Wellington F. Silva

December 13, 2020
Tweet

More Decks by Wellington F. Silva

Other Decks in Technology

Transcript

  1. Improving
    Observability with
    Prometheus
    Darkmira Tour PHP 2020

    View Slide

  2. Wellington F. Silva
    contact:
    @_wsilva
    nicks:
    wsilva, boina, tom, fisi*
    Roles:
    pai, marido, tec. telecom,
    programador, sysadmin,
    docker community leader,
    instrutor, escritor, zend
    certified engineer e docker
    certified associate, certified
    kubernetes administrator
    * in deprecation

    View Slide

  3. Agenda
    • Observability
    • Monitoring
    • Prometheus
    • Tips

    View Slide

  4. Observability

    View Slide

  5. Observability
    Definition
    • https://dictionary.cambridge.org/us/dictionary/
    english/observability
    • https://www.oxfordlearnersdictionaries.com/
    spellcheck/english/?q=observability

    View Slide

  6. Observability
    Definition Cambridge:

    View Slide

  7. Observability
    Definition Oxford:

    View Slide

  8. Observability
    Definition:
    ¯\_(ツ)_/¯

    View Slide

  9. Observability
    Definition:
    Observe + Ability

    View Slide

  10. Observability
    3 pilars:

    View Slide

  11. Observability
    3 pilars:
    • Metrics

    View Slide

  12. Observability
    3 pilars:
    • Metrics
    • Logging

    View Slide

  13. Observability
    3 pilars:
    • Metrics
    • Logging
    • Tracing

    View Slide

  14. Observability
    3 pilars:
    • Metrics
    • Logging
    • Tracing
    • Events

    View Slide

  15. Observability
    3 pilars:
    • Metrics
    • Logging
    • Tracing
    • Events (kind of new) - MELT, or 4 golden
    signals

    View Slide

  16. Observability
    advantages

    View Slide

  17. Observability
    • Better deployments

    View Slide

  18. Observability
    • Better deployments
    • Improve time to market

    View Slide

  19. Observability
    • Better deployments
    • Improve time to market
    • Less toil

    View Slide

  20. Observability
    • Better deployments
    • Improve time to market
    • Less toil
    • Avoid premature optimisation

    View Slide

  21. Observability
    • Better deployments
    • Improve time to market
    • Less toil
    • Avoid premature optimisation
    • Improve resource utilisation

    View Slide

  22. Observability
    • Better deployments
    • Improve time to market
    • Less toil
    • Avoid premature optimisation
    • Improve resource utilisation
    • Lower costs

    View Slide

  23. Observability
    disadvantages

    View Slide

  24. Observability
    • Demand effort on coding and configuring

    View Slide

  25. Observability
    • Demand effort on coding and configuring
    • Could extends time to delivery

    View Slide

  26. Observability
    • Demand effort on coding and configuring
    • Could extends time to delivery
    • Constant neglected

    View Slide

  27. Monitoring

    View Slide

  28. Monitoring
    • Subset of observability

    View Slide

  29. Monitoring
    • Subset of observability
    • Show points where we start to dig

    View Slide

  30. Monitoring
    • Subset of observability
    • Show points where we start to dig
    • Makes it easier and faster to find bottlenecks

    View Slide

  31. Monitoring
    What metrics should we track?

    View Slide

  32. Monitoring
    What metrics should we track?
    ALL

    View Slide

  33. Monitoring
    https://www.aeroflap.com.br/uma-analise-evolucao-do-boeing-737/

    View Slide

  34. Monitoring
    Issue: More metrics more difficult to analyse

    View Slide

  35. Monitoring
    Issue: More metrics more difficult to analyse
    MELT become mess

    View Slide

  36. Monitoring
    http://aeroplanewallpaper.blogspot.com

    View Slide

  37. Monitoring
    More focus on more important things.

    View Slide

  38. Monitoring
    RED Method:

    View Slide

  39. Monitoring
    RED Method:
    • Rate

    View Slide

  40. Monitoring
    RED Method:
    • Rate
    • Errors

    View Slide

  41. Monitoring
    RED Method:
    • Rate
    • Errors
    • Duration

    View Slide

  42. Monitoring
    RED Method:
    • Rate
    • Errors
    • Duration
    • Saturation

    View Slide

  43. Monitoring
    RED Method:
    • Rate
    • Errors
    • Duration
    • Saturation (again this joke?)

    View Slide

  44. Monitoring
    Google’s SRE Book Way:

    View Slide

  45. Monitoring
    Google’s SRE Book Way:
    • SLI
    • SLO
    • SLA

    View Slide

  46. Monitoring
    SLI - Service Level Indicators

    View Slide

  47. Monitoring
    SLI - Service Level Indicator
    Depends on the team:

    View Slide

  48. Monitoring
    SLI - Service Level Indicator
    Depends on the team:
    • ops: cpu, memory, disk io, networking, nodes
    available, pods running, messages on queue

    View Slide

  49. Monitoring
    SLI - Service Level Indicator
    Depends on the team:
    • ops: cpu, memory, disk io, networking, nodes
    available, pods running, messages on queue
    • devs, response time, requests per second

    View Slide

  50. Monitoring
    SLI - Service Level Indicator
    Depends on the team:
    • ops: cpu, memory, disk io, networking, nodes
    available, pods running, messages on queue
    • devs, response time, requests per second
    • data engineers, time to run an ETL job, how
    many data are been processed, the freshness
    of the data

    View Slide

  51. Monitoring
    SLO - Service Level Objectives

    View Slide

  52. Monitoring
    SLO - Service Level Objectives
    • We should involve costumer to help define it

    View Slide

  53. Monitoring
    SLO - Service Level Objectives
    • We should involve costumer to help define it
    • Breaches must alert the team

    View Slide

  54. Monitoring
    SLO - Service Level Objectives
    • We should involve costumer to help define it
    • Breaches must alert the team
    • Use realistic objectives

    View Slide

  55. Monitoring
    SLO - Service Level Objectives
    • We should involve costumer to help define it
    • Breaches must alert the team
    • Use realistic objectives
    • Reevaluate the values periodically

    View Slide

  56. Monitoring
    SLA - Service Level Agreement

    View Slide

  57. Monitoring
    SLA - Service Level Agreement
    • Should be higher than the SLO. When SLO
    breaches it must alerts before SLA breaches

    View Slide

  58. Monitoring
    SLA - Service Level Agreement
    • Should be higher than the SLO. When SLO
    breaches it must alerts before SLA breaches
    • Pay attention on the agreement and honor it

    View Slide

  59. Prometheus

    View Slide

  60. Prometheus
    From the Greek Promēthéus,
    "forethought". He is a titan
    (second generation), son of
    Iapetus (son of Uranus; an
    incest between Uranus and
    Gaia) and brother of Atlas,
    Epimetheus and Menoetius.
    He was a defender of
    humanity, responsible for
    stealing Hestia's fire and give
    it to mortals.

    View Slide

  61. Prometheus
    • Metrics platform

    View Slide

  62. Prometheus
    • Metrics platform
    • Started in 2012 at SoundCloud

    View Slide

  63. Prometheus
    • Metrics platform
    • Started in 2012 at SoundCloud
    • Opensourced and published in 2015

    View Slide

  64. Prometheus
    • Metrics platform
    • Started in 2012 at SoundCloud
    • Opensourced and published in 2015
    • Second project under CNCF (Cloud Native
    Computing Foundation)

    View Slide

  65. Prometheus
    • Metrics platform
    • Started in 2012 at SoundCloud
    • Opensourced and published in 2015
    • Second project under CNCF (Cloud Native
    Computing Foundation)
    • Can also fire and manage alerts

    View Slide

  66. Prometheus
    • Metrics platform
    • Started in 2012 at SoundCloud
    • Opensourced and published in 2015
    • Second project under CNCF (Cloud Native
    Computing Foundation)
    • Can also fire and manage alerts
    • Stores metric in time series database (TSDB)

    View Slide

  67. Prometheus
    • Pull based model (scale the exporter)

    View Slide

  68. Prometheus
    • Pull based model (scale the exporter)
    • Good for telemetry metrics and statistical
    metrics

    View Slide

  69. Prometheus
    • Pull based model (scale the exporter)
    • Good for telemetry metrics and statistical
    metrics
    • Known alternatives: graphite / collectd /
    carbon, zabbix (all push based)

    View Slide

  70. Prometheus
    Disadvantages:
    • Not too easy to horizontal scale

    View Slide

  71. Prometheus
    Disadvantages:
    • Not too easy to horizontal scale
    • No query cache

    View Slide

  72. Prometheus
    Disadvantages:
    • Not too easy to horizontal scale
    • No query cache
    • PromQL instead of regular SQL

    View Slide

  73. Prometheus
    Advantages:
    • Written in Go lang

    View Slide

  74. Prometheus
    Advantages:
    • Written in Go lang
    • Http based communication

    View Slide

  75. Prometheus
    Advantages:
    • Written in Go lang
    • Http based communication
    • Service discover integration (kubernetes,
    Swarm, Consul, AWS, GCP, etc)

    View Slide

  76. Prometheus
    Advantages:
    • Written in Go lang
    • Http based communication
    • Service discover integration (kubernetes,
    Swarm, Consul, AWS, GCP, etc)
    • Dashboard for alerts management

    View Slide

  77. Prometheus
    Advantages:
    • Written in Go lang
    • Http based communication
    • Service discover integration (kubernetes,
    Swarm, Consul, AWS, GCP, etc)
    • Dashboard for alerts management
    • Dashboard for query debugging

    View Slide

  78. Prometheus
    Advantages:
    • Multidimensional data model

    View Slide

  79. Prometheus
    Advantages:
    • Multidimensional data model
    • Easy to set up with Grafana

    View Slide

  80. Prometheus
    Advantages:
    • Multidimensional data model
    • Easy to set up with Grafana
    • PromQL ( kind of functional style, power for
    calculation)

    View Slide

  81. Prometheus

    View Slide

  82. Tips

    View Slide

  83. Tips
    Start with https://github.com/endclothing/
    prometheus_client_php
    Package jimdo/prometheus_client_php is
    abandoned
    $ composer require endclothing/
    prometheus_client_php

    View Slide

  84. Tips
    To set up a counter
    $registry =
    \Prometheus\CollectorRegistry::getDefault();
    $counter = $registry-
    >getOrRegisterCounter('demo', 'visitor_counter',
    'it increases', ['type']);
    $counter->incBy(3, ['blue']);

    View Slide

  85. Tips
    To set up a gauge
    $registry =
    \Prometheus\CollectorRegistry::getDefault();
    $gauge = $registry->getOrRegisterGauge('demo',
    'score', 'it sets', ['type']);
    $gauge->set(2.5, ['blue']);

    View Slide

  86. Tips
    To set up an histogram
    $registry =
    \Prometheus\CollectorRegistry::getDefault();
    $histogram = $registry-
    >getOrRegisterHistogram('demo', ‘secs_bucket',
    'it observes', ['type'], [0.1, 1, 2, 3.5, 4, 5,
    6, 7, 8, 9]);
    $histogram->observe(3.5, ['blue']);

    View Slide

  87. Tips
    To show the metrics to be scraped
    $registry =
    \Prometheus\CollectorRegistry::getDefault();
    $renderer = new RenderTextFormat();
    $result = $renderer->render(
    $registry->getMetricFamilySamples()
    );
    header('Content-type: ' .
    RenderTextFormat::MIME_TYPE);
    echo $result;

    View Slide

  88. Tips
    Starts with RED method
    Set up the following query
    (ud:itentity:rate_10m < bool 1000) * 100 +
    (ud:error:percent_10m > bool 1.5) * 10 +
    (ud:read:duration_p99_10m < bool 25) * 1

    View Slide

  89. Tips
    Define a dashboard in Grafana that maps the following
    results:
    111 = x Rate, x Errors, x Duration
    110 = x Rate, x Errors
    101 = x Rate, x Duration
    100 = x Rate
    011 = x Errors, x Duration
    010 = x Errors
    001 = x Duration
    000 = Ok

    View Slide

  90. Demo

    View Slide

  91. Demo
    Available at:
    https://github.com/wsilva/darkmira-prometheus-
    php-demo

    View Slide

  92. Thank You !
    Slides: https://speakerdeck.com/wsilva

    View Slide