Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RubyKaigi 2019: Yabeda: monitoring monogatari

RubyKaigi 2019: Yabeda: monitoring monogatari

Developing large and high load application without monitoring is a hard and dangerous task, just like piloting an aircraft without gauges. Why it is so important to keep an eye on how application's “flight“ goes, what things you should care more, and how graphs may help to resolve occurring performance problems quickly. On an example of Sidekiq, Prometheus, and Grafana, we will learn how to take metrics from the Ruby application, what can we see from their visualization, and I will tell a couple of sad tales about how it helps in everyday life (with happy-end!)

Andrey Novikov

April 19, 2019
Tweet

More Decks by Andrey Novikov

Other Decks in Programming

Transcript

  1. Korolyov Houston, we have a problem. * Have never actually

    heard such an expression. slides.com/ai/space-fails-jp
  2. Questions we need to answer Do we executing more jobs

    than before? Does optimization or scaling help? Are we going to drain queues at all and when? ‍♂
  3. What else do we want to know? How quick queues

    are growing and shrinking ⏳ How long it takes to execute different job types ⏰ When things started to go south (slowed down) What is the reason, after all?
  4. What else to track to understand causes? Servers: CPU load,

    RAM usage, IO, average load, storage usage, … Web: request count, processing time, queuing time, … Queued jobs: jobs count, processing times, error rates, … Databases: queries count, processing time, returned rows count, … External APIs: requests count, processing times, error rates, … Network: latency, DNS lookup time, throughput, package drop rates, … Custom things: different metrics for every integration or whatever Business metrics: number of users, products, orders, , …
  5. Ultimately, what do we want? More graphs for the god

    of graphs! …and second, third, or even fourth display, too…
  6. 19

  7. What is monitoring, at least? Continuous process of observing and

    recording system parameters compared to predetermined criteria. “You can't control what you can't measure” – Tom DeMarco
  8. Metric System parameter with numerical value Examples for Sidekiq: -

    number of jobs executed (jobs) - size of the queue (jobs) - average job processing duration (seconds)
  9. Time series Data points (metric values) listed or graphed in

    time order. Common visualization of time series:
  10. Labels Allows to segment metrics by some attributes and aggregate

    them again. Examples: Rails requests by controller and action, Sidekiq jobs executed by job class, Sidekiq queue sizes by queue names. Warning: for every unique set of labels separate time series is generated!
  11. Counter Tracks number of events occured Examples: number of web

    requests, network packets, Sidekiq jobs executed CRDT: G-Counter – monotonically increasing counter. Allows to detect application restarts and reliably aggregate values from any number of machines or processes Base type for more complex types like histogram
  12. Gauges Tracks state of external world. No restrictions on its

    changing Examples: temperature, CPU load, RAM usage, Sidekiq queue size
  13. Histogram and summary Complex types to get rough statistics about

    duration of some events like average time, 95th percentile, maximums and minimums Sidekiq example: job execution duration (seconds) Histogram: set of counters for number of jobs matching predefined “buckets” Summary: precalculated aggregations by client (like 95% percentile) They’re not precise! Many monitoring systems doesn’t allow to choose between these types.
  14. DIY: metric registration prometheus = Prometheus::Client.registry prometheus.counter :sidekiq_jobs_executed_total, "Total number

    of jobs sidekiq executed." prometheus.histogram :sidekiq_job_runtime_seconds, "A histogram of the job execution time." prometheus.gauge :sidekiq_jobs_waiting_count, "The number of jobs waiting to process in sidekiq." prometheus.gauge :sidekiq_active_workers_count, "The number of currently running machines with ..." prometheus.gauge :sidekiq_jobs_scheduled_count, "The number of jobs scheduled for later execution." prometheus.gauge :sidekiq_jobs_retry_count, "The number of failed jobs waiting to be retried" prometheus.gauge :sidekiq_jobs_dead_count, "The number of jobs exceeded their retry count." prometheus.gauge :sidekiq_active_processes_count, "The number of active Sidekiq worker processes."
  15. DIY: collect job execution metrics class SidekiqWorkerPrometheusMiddleware def call(worker, job,

    queue) prometheus = Prometheus::Client.registry labels = { queue: queue, worker: worker.class } start = Time.now begin yield prometheus.get(:sidekiq_jobs_executed_total).increment(labels.merge(success: true)) rescue Exception # rubocop: disable Lint/RescueException prometheus.get(:sidekiq_jobs_executed_total).increment(labels.merge(success: false)) raise ensure prometheus.get(:sidekiq_job_runtime_seconds).observe(labels, (Time.now - start).round(3)) end end end
  16. DIY: collect queue stats class SidekiqExporter < ::Prometheus::Middleware::Exporter def call(env)

    refresh_sidekiq_stats if env["REQUEST_PATH"].start_with?(path) super end def refresh_sidekiq_stats r = Prometheus.registry stats = ::Sidekiq::Stats.new stats.queues.each do |k, v| r.get(:sidekiq_jobs_waiting_count).set({ queue: k }, v) end r.get(:sidekiq_active_workers_count).set({}, stats.workers_size) r.get(:sidekiq_jobs_scheduled_count).set({}, stats.scheduled_size) r.get(:sidekiq_jobs_dead_count).set({}, stats.dead_size) r.get(:sidekiq_active_processes).set({}, stats.processes_size) end end
  17. DIY: Export metrics to prometheus Sidekiq.configure_server do |config| config.redis =

    redis_credentials config.server_middleware do |chain| chain.add SidekiqWorkerPrometheusMiddleware end # Start server to expose metrics for Prometheus collector Thread.new do Rack::Handler::WEBrick.run( Rack::Builder.new do use Rack::CommonLogger, ::Sidekiq.logger use SidekiqExporter, registry: Prometheus.registry # Exporter from the previous slide run ->(_env) { [404, {"Content-Type" => "text/plain"}, ["Not Found\n"]] } end, Host: ENV["PROMETHEUS_EXPORTER_BIND"] || "0.0.0.0", Port: ENV.fetch("PROMETHEUS_EXPORTER_PORT", 9310), AccessLog: [] ) end end This is because Prometheus works in pull model (prometheus will collect metrics itself) while most other monitoring systems are working in push model (agents sends metrics themselves)
  18. DIY: Success! curl http://localhost:9091/metrics # TYPE sidekiq_jobs_executed_total counter # HELP

    sidekiq_jobs_executed_total A counter of the total number of jobs sidekiq executed. sidekiq_jobs_executed_total{queue="utils",worker="EmptyJob",success=”true”} 16 sidekiq_jobs_executed_total{queue="default",worker="WaitingJob",success=”true”} 160 # TYPE sidekiq_jobs_waiting_count gauge # HELP sidekiq_jobs_waiting_count The number of jobs waiting to process in sidekiq. sidekiq_jobs_waiting_count{queue="utils"} 2190 sidekiq_jobs_waiting_count{queue="default"} 1490 # TYPE sidekiq_job_runtime_seconds histogram # HELP sidekiq_job_runtime_seconds A histogram of the job execution time. sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="0.5"} 9.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="1"} 16.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="5"} 16.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="10"} 16.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="60"} 16.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="300"} 16.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="+Inf"} 16.0 sidekiq_job_runtime_seconds_sum{queue="utils",worker="EmptyJob"} 7.963
  19. But don’t create too much alerts. Evaluate each alert’s usefulness

    against CASE methodology. Alerts! Let robots analyze your metrics!
  20. Yabeda A set of Ruby libraries for easy collecting and

    exporting application metrics. # Gemfile gem "yabeda-rails" # Yay! Standard RoR metrics already tracked! gem "yabeda-sidekiq" # Yay! Sidekiq metrics are registered and tracked! gem "yabeda-prometheus" # Now you can export metrics to Prometheus # But some configuration is still needed gem "yabeda-newrelic" # And to NewRelic at the same time
  21. Yabeda: registering metrics Yabeda.configure do group :sidekiq do counter(:jobs_total_count, comment:

    "Total number of jobs sidekiq executed.") counter(:jobs_success_count, comment: "Total number of jobs successfully processed by sidekiq.") counter(:jobs_failed_count, comment: "Total number of jobs failed in sidekiq.") histogram(:job_runtime, comment: "Job execution time.", unit: :seconds) gauge(:jobs_waiting_count, comment: "The number of jobs waiting to process in sidekiq.") end end
  22. Yabeda: collecting metrics # Explicitly in the application code Yabeda.things_duration.measure({type:

    :foo}, 10) Yabeda.things_happened_total.increment({type: :foo}) # By subscribing to some events ActiveSupport::Notifications.subscribe 'occurence.custom_things' do |*args| Yabeda.things_duration.measure({type: :foo}, 10) Yabeda.things_happened_total.increment({type: :foo}) end # Get values only when it is a time to send metrics to the monitoring system Yabeda.configure do collect do things_alive_total.set({}, Thing.alive.count) end end
  23. What’s available out of the box? Plugins • Rails •

    Sidekiq • Puma (thanks to Dmitry Salahutdinov) (request queueing time metric included) Adapters • Prometheus • NewRelic • DataDog (thanks to Dmitry Shvetsov) Want more? Search GitHub and RubyGems for Yabeda and find more gems from the community. And, as always, you can create your own and let me know! Come and tell me after this talk, what plugin or adapter you want to be made next.
  24. What does name “Yabeda” mean? “Yabeda” is a “tattletale” in

    Russian – child that reports wrongdoings to adults. 「ヤベダ」は他の子供のいたずらを大人に言い付ける子供です。 Picture source: www.psychologos.ru/articles/view/yabeda Sidekiq Yabeda Prometheus Look, ma! Sidekiq is eating RAM again! Of course, this isn’t Sidekiq blame, but a side-effect of memory handling in Ruby. Read part 5 and part 6 of “Sidekiq in Practice” series from Nate Berkopec on how to understand and fix memory issues.
  25. Yabeda A set of Ruby libraries to easily collect and

    export application metrics. github.com/yabeda-rb Check an example yourself: github.com/yabeda-rb/example-prometheus
  26. Conclusions With metrics collected and visualized, debugging becomes much easier

    Production will still break, but restoration will be much faster Effectiveness of changes and optimizations can be tracked Everyone can answer the question “So, how we are going?” from the one sight
  27. evilmartians.com twitter.com/evilmartians More in our blogs: evilmartians.com/chronicles dev.to/evilmartians Evil Martiansのステッカーを2階で取ってください

    Questions? Ask them! 日本語でもいいです。 twitter.com/Envek github.com/Envek WE HAVE STICKERS! Please come and grab’em! Yabeda: github.com/yabeda-rb Read more about metric-driven development in these articles: