Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RubyKaigi 2019: Yabeda: monitoring monogatari

RubyKaigi 2019: Yabeda: monitoring monogatari

Developing large and high load application without monitoring is a hard and dangerous task, just like piloting an aircraft without gauges. Why it is so important to keep an eye on how application's “flight“ goes, what things you should care more, and how graphs may help to resolve occurring performance problems quickly. On an example of Sidekiq, Prometheus, and Grafana, we will learn how to take metrics from the Ruby application, what can we see from their visualization, and I will tell a couple of sad tales about how it helps in everyday life (with happy-end!)

Andrey Novikov

April 19, 2019
Tweet

More Decks by Andrey Novikov

Other Decks in Programming

Transcript

  1. Monitoring monogatari
    Andrey Novikov
    RubyKaigi 2019
    Fukuoka, Japan

    View Slide

  2. Hi, I’m Andrey and I’m from Korolyov, the Space capital of Russia.

    View Slide

  3. The main ISS mission control room, Flights Control Center, Korolyov. Photo by NASA/Bill Ingalls

    View Slide

  4. Korolyov
    Houston, we have a problem.
    * Have never actually heard such an expression.
    slides.com/ai/space-fails-jp

    View Slide

  5. View Slide

  6. View Slide

  7. 邪悪な火星人

    View Slide

  8. View Slide

  9. View Slide

  10. View Slide

  11. View Slide

  12. 12
    http://fotorelax.ru/nastoyashhaya-probka-v-kitae/
    Meanwhile somewhere in China

    View Slide

  13. Korolyov, we have a problem.

    View Slide

  14. View Slide

  15. Questions we need to answer
    Do we executing more jobs than before?
    Does optimization or scaling help?
    Are we going to drain queues at all and when? ‍♂

    View Slide

  16. What else do we want to know?
    How quick queues are growing and shrinking
    ⏳ How long it takes to execute different job types
    ⏰ When things started to go south (slowed down)
    What is the reason, after all?

    View Slide

  17. What else to track to understand causes?
    Servers: CPU load, RAM usage, IO, average load, storage usage, …
    Web: request count, processing time, queuing time, …
    Queued jobs: jobs count, processing times, error rates, …
    Databases: queries count, processing time, returned rows count, …
    External APIs: requests count, processing times, error rates, …
    Network: latency, DNS lookup time, throughput, package drop rates, …
    Custom things: different metrics for every integration or whatever
    Business metrics: number of users, products, orders, , …

    View Slide

  18. Ultimately, what do we want?
    More graphs for the god of graphs!
    …and second, third, or even fourth display, too…

    View Slide

  19. 19

    View Slide

  20. 20
    Tu-144 airplane cockpit
    Tu-144 is a Soviet rival of the Concord

    View Slide

  21. Now we need to learn some theory

    View Slide

  22. What is monitoring, at least?
    Continuous process of observing and recording system parameters
    compared to predetermined criteria.
    “You can't control what you can't measure”
    – Tom DeMarco

    View Slide

  23. Metric
    System parameter with numerical value
    Examples for Sidekiq:
    - number of jobs executed (jobs)
    - size of the queue (jobs)
    - average job processing duration (seconds)

    View Slide

  24. Time series
    Data points (metric values) listed or graphed in time order.
    Common visualization of time series:

    View Slide

  25. Labels
    Allows to segment metrics by some attributes and aggregate them again.
    Examples: Rails requests by controller and action, Sidekiq jobs executed by job
    class, Sidekiq queue sizes by queue names.
    Warning: for every unique set of labels separate time series is generated!

    View Slide

  26. Counter
    Tracks number of events occured
    Examples: number of web requests, network packets, Sidekiq jobs executed
    CRDT: G-Counter – monotonically increasing counter. Allows to detect
    application restarts and reliably aggregate values from any number of machines
    or processes
    Base type for more complex types like histogram

    View Slide

  27. Gauges
    Tracks state of external world. No restrictions on its changing
    Examples: temperature, CPU load, RAM usage, Sidekiq queue size

    View Slide

  28. Histogram and summary
    Complex types to get rough statistics about duration of some events like
    average time, 95th percentile, maximums and minimums
    Sidekiq example: job execution duration (seconds)
    Histogram: set of counters for number of jobs matching predefined “buckets”
    Summary: precalculated aggregations by client (like 95% percentile)
    They’re not precise!
    Many monitoring systems doesn’t allow to choose between these types.

    View Slide

  29. That’s all for theory! Let’s go to the real job!

    View Slide

  30. Do It Yourself with Prometheus
    # Gemfile
    gem "prometheus-client", "~> 0.8"

    View Slide

  31. DIY: metric registration
    prometheus = Prometheus::Client.registry
    prometheus.counter :sidekiq_jobs_executed_total, "Total number of jobs sidekiq executed."
    prometheus.histogram :sidekiq_job_runtime_seconds, "A histogram of the job execution time."
    prometheus.gauge :sidekiq_jobs_waiting_count, "The number of jobs waiting to process in sidekiq."
    prometheus.gauge :sidekiq_active_workers_count, "The number of currently running machines with ..."
    prometheus.gauge :sidekiq_jobs_scheduled_count, "The number of jobs scheduled for later execution."
    prometheus.gauge :sidekiq_jobs_retry_count, "The number of failed jobs waiting to be retried"
    prometheus.gauge :sidekiq_jobs_dead_count, "The number of jobs exceeded their retry count."
    prometheus.gauge :sidekiq_active_processes_count, "The number of active Sidekiq worker processes."

    View Slide

  32. DIY: collect job execution metrics
    class SidekiqWorkerPrometheusMiddleware
    def call(worker, job, queue)
    prometheus = Prometheus::Client.registry
    labels = { queue: queue, worker: worker.class }
    start = Time.now
    begin
    yield
    prometheus.get(:sidekiq_jobs_executed_total).increment(labels.merge(success: true))
    rescue Exception # rubocop: disable Lint/RescueException
    prometheus.get(:sidekiq_jobs_executed_total).increment(labels.merge(success: false))
    raise
    ensure
    prometheus.get(:sidekiq_job_runtime_seconds).observe(labels, (Time.now - start).round(3))
    end
    end
    end

    View Slide

  33. DIY: collect queue stats
    class SidekiqExporter < ::Prometheus::Middleware::Exporter
    def call(env)
    refresh_sidekiq_stats if env["REQUEST_PATH"].start_with?(path)
    super
    end
    def refresh_sidekiq_stats
    r = Prometheus.registry
    stats = ::Sidekiq::Stats.new
    stats.queues.each do |k, v|
    r.get(:sidekiq_jobs_waiting_count).set({ queue: k }, v)
    end
    r.get(:sidekiq_active_workers_count).set({}, stats.workers_size)
    r.get(:sidekiq_jobs_scheduled_count).set({}, stats.scheduled_size)
    r.get(:sidekiq_jobs_dead_count).set({}, stats.dead_size)
    r.get(:sidekiq_active_processes).set({}, stats.processes_size)
    end
    end

    View Slide

  34. DIY: Export metrics to prometheus
    Sidekiq.configure_server do |config|
    config.redis = redis_credentials
    config.server_middleware do |chain|
    chain.add SidekiqWorkerPrometheusMiddleware
    end
    # Start server to expose metrics for Prometheus collector
    Thread.new do
    Rack::Handler::WEBrick.run(
    Rack::Builder.new do
    use Rack::CommonLogger, ::Sidekiq.logger
    use SidekiqExporter, registry: Prometheus.registry # Exporter from the previous slide
    run ->(_env) { [404, {"Content-Type" => "text/plain"}, ["Not Found\n"]] }
    end,
    Host: ENV["PROMETHEUS_EXPORTER_BIND"] || "0.0.0.0",
    Port: ENV.fetch("PROMETHEUS_EXPORTER_PORT", 9310),
    AccessLog: []
    )
    end
    end
    This is because Prometheus works in pull model
    (prometheus will collect metrics itself) while most
    other monitoring systems are working in push
    model (agents sends metrics themselves)

    View Slide

  35. DIY: Success!
    curl http://localhost:9091/metrics
    # TYPE sidekiq_jobs_executed_total counter
    # HELP sidekiq_jobs_executed_total A counter of the total number of jobs sidekiq executed.
    sidekiq_jobs_executed_total{queue="utils",worker="EmptyJob",success=”true”} 16
    sidekiq_jobs_executed_total{queue="default",worker="WaitingJob",success=”true”} 160
    # TYPE sidekiq_jobs_waiting_count gauge
    # HELP sidekiq_jobs_waiting_count The number of jobs waiting to process in sidekiq.
    sidekiq_jobs_waiting_count{queue="utils"} 2190
    sidekiq_jobs_waiting_count{queue="default"} 1490
    # TYPE sidekiq_job_runtime_seconds histogram
    # HELP sidekiq_job_runtime_seconds A histogram of the job execution time.
    sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="0.5"} 9.0
    sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="1"} 16.0
    sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="5"} 16.0
    sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="10"} 16.0
    sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="60"} 16.0
    sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="300"} 16.0
    sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="+Inf"} 16.0
    sidekiq_job_runtime_seconds_sum{queue="utils",worker="EmptyJob"} 7.963

    View Slide

  36. DIY: Visualize in Grafana

    View Slide

  37. Easy, right?

    View Slide

  38. View Slide

  39. View Slide

  40. So what?
    Finally, it’s time for the monogatari!

    View Slide

  41. View Slide

  42. View Slide

  43. Korolyov, we have a problem.

    View Slide

  44. View Slide

  45. View Slide

  46. What about reason?
    It was outside of Sidekiq or Ruby. It was in database

    View Slide

  47. Debugging time – just a few minutes!

    View Slide

  48. But don’t create too much alerts. Evaluate each alert’s usefulness against CASE methodology.
    Alerts! Let robots analyze your metrics!

    View Slide

  49. Korolyov, we have a problem.


    View Slide

  50. Cool! But can’t it be done easier?

    View Slide

  51. Yabeda
    A set of Ruby libraries for easy collecting and exporting application metrics.
    # Gemfile
    gem "yabeda-rails"
    # Yay! Standard RoR metrics already tracked!
    gem "yabeda-sidekiq"
    # Yay! Sidekiq metrics are registered and tracked!
    gem "yabeda-prometheus"
    # Now you can export metrics to Prometheus
    # But some configuration is still needed
    gem "yabeda-newrelic"
    # And to NewRelic at the same time

    View Slide

  52. Yabeda: registering metrics
    Yabeda.configure do
    group :sidekiq do
    counter(:jobs_total_count, comment: "Total number of jobs sidekiq executed.")
    counter(:jobs_success_count, comment: "Total number of jobs successfully processed by sidekiq.")
    counter(:jobs_failed_count, comment: "Total number of jobs failed in sidekiq.")
    histogram(:job_runtime, comment: "Job execution time.", unit: :seconds)
    gauge(:jobs_waiting_count, comment: "The number of jobs waiting to process in sidekiq.")
    end
    end

    View Slide

  53. Yabeda: collecting metrics
    # Explicitly in the application code
    Yabeda.things_duration.measure({type: :foo}, 10)
    Yabeda.things_happened_total.increment({type: :foo})
    # By subscribing to some events
    ActiveSupport::Notifications.subscribe 'occurence.custom_things' do |*args|
    Yabeda.things_duration.measure({type: :foo}, 10)
    Yabeda.things_happened_total.increment({type: :foo})
    end
    # Get values only when it is a time to send metrics to the monitoring system
    Yabeda.configure do
    collect do
    things_alive_total.set({}, Thing.alive.count)
    end
    end

    View Slide

  54. What’s available out of the box?
    Plugins
    ● Rails
    ● Sidekiq
    ● Puma (thanks to Dmitry Salahutdinov)
    (request queueing time metric included)
    Adapters
    ● Prometheus
    ● NewRelic
    ● DataDog (thanks to Dmitry Shvetsov)
    Want more?
    Search GitHub and RubyGems for Yabeda and find more gems from the community.
    And, as always, you can create your own and let me know!
    Come and tell me after this talk, what plugin or adapter you want to be made next.

    View Slide

  55. What does name “Yabeda” mean?
    “Yabeda” is a “tattletale” in Russian – child that reports wrongdoings to adults.
    「ヤベダ」は他の子供のいたずらを大人に言い付ける子供です。
    Picture source: www.psychologos.ru/articles/view/yabeda
    Sidekiq Yabeda Prometheus
    Look, ma!
    Sidekiq is eating RAM again!
    Of course, this isn’t Sidekiq blame, but a
    side-effect of memory handling in Ruby.
    Read part 5 and part 6 of “Sidekiq in Practice”
    series from Nate Berkopec on how to
    understand and fix memory issues.

    View Slide

  56. Yabeda
    A set of Ruby libraries to easily collect and export application metrics.
    github.com/yabeda-rb
    Check an example yourself:
    github.com/yabeda-rb/example-prometheus

    View Slide

  57. Conclusions
    With metrics collected and visualized, debugging becomes much easier
    Production will still break, but restoration will be much faster
    Effectiveness of changes and optimizations can be tracked
    Everyone can answer the question “So, how we are going?” from the one sight

    View Slide

  58. evilmartians.com
    twitter.com/evilmartians
    More in our blogs:
    evilmartians.com/chronicles
    dev.to/evilmartians
    Evil Martiansのステッカーを2階で取ってください
    Questions? Ask them! 日本語でもいいです。
    twitter.com/Envek
    github.com/Envek
    WE HAVE STICKERS! Please come and grab’em!
    Yabeda: github.com/yabeda-rb
    Read more about metric-driven development in these articles:

    View Slide