RubyKaigi 2019: Yabeda: monitoring monogatari

Monitoring monogatari Andrey Novikov RubyKaigi 2019 Fukuoka, Japan

Hi, I’m Andrey and I’m from Korolyov, the Space capital
of Russia.

The main ISS mission control room, Flights Control Center, Korolyov.
Photo by NASA/Bill Ingalls

Korolyov Houston, we have a problem. * Have never actually
heard such an expression. slides.com/ai/space-fails-jp

邪悪な火星人

12 http://fotorelax.ru/nastoyashhaya-probka-v-kitae/ Meanwhile somewhere in China

Korolyov, we have a problem.

Questions we need to answer Do we executing more jobs
than before? Does optimization or scaling help? Are we going to drain queues at all and when? ‍♂

What else do we want to know? How quick queues
are growing and shrinking ⏳ How long it takes to execute diﬀerent job types ⏰ When things started to go south (slowed down) What is the reason, after all?

What else to track to understand causes? Servers: CPU load,
RAM usage, IO, average load, storage usage, … Web: request count, processing time, queuing time, … Queued jobs: jobs count, processing times, error rates, … Databases: queries count, processing time, returned rows count, … External APIs: requests count, processing times, error rates, … Network: latency, DNS lookup time, throughput, package drop rates, … Custom things: diﬀerent metrics for every integration or whatever Business metrics: number of users, products, orders, , …

Ultimately, what do we want? More graphs for the god
of graphs! …and second, third, or even fourth display, too…

20 Tu-144 airplane cockpit Tu-144 is a Soviet rival of
the Concord

Now we need to learn some theory

What is monitoring, at least? Continuous process of observing and
recording system parameters compared to predetermined criteria. “You can't control what you can't measure” – Tom DeMarco

Metric System parameter with numerical value Examples for Sidekiq: -
number of jobs executed (jobs) - size of the queue (jobs) - average job processing duration (seconds)

Time series Data points (metric values) listed or graphed in
time order. Common visualization of time series:

Labels Allows to segment metrics by some attributes and aggregate
them again. Examples: Rails requests by controller and action, Sidekiq jobs executed by job class, Sidekiq queue sizes by queue names. Warning: for every unique set of labels separate time series is generated!

Counter Tracks number of events occured Examples: number of web
requests, network packets, Sidekiq jobs executed CRDT: G-Counter – monotonically increasing counter. Allows to detect application restarts and reliably aggregate values from any number of machines or processes Base type for more complex types like histogram

Gauges Tracks state of external world. No restrictions on its
changing Examples: temperature, CPU load, RAM usage, Sidekiq queue size

Histogram and summary Complex types to get rough statistics about
duration of some events like average time, 95th percentile, maximums and minimums Sidekiq example: job execution duration (seconds) Histogram: set of counters for number of jobs matching predeﬁned “buckets” Summary: precalculated aggregations by client (like 95% percentile) They’re not precise! Many monitoring systems doesn’t allow to choose between these types.

That’s all for theory! Let’s go to the real job!
‍

Do It Yourself with Prometheus # Gemfile gem "prometheus-client", "~>
0.8"

DIY: metric registration prometheus = Prometheus::Client.registry prometheus.counter :sidekiq_jobs_executed_total, "Total number
of jobs sidekiq executed." prometheus.histogram :sidekiq_job_runtime_seconds, "A histogram of the job execution time." prometheus.gauge :sidekiq_jobs_waiting_count, "The number of jobs waiting to process in sidekiq." prometheus.gauge :sidekiq_active_workers_count, "The number of currently running machines with ..." prometheus.gauge :sidekiq_jobs_scheduled_count, "The number of jobs scheduled for later execution." prometheus.gauge :sidekiq_jobs_retry_count, "The number of failed jobs waiting to be retried" prometheus.gauge :sidekiq_jobs_dead_count, "The number of jobs exceeded their retry count." prometheus.gauge :sidekiq_active_processes_count, "The number of active Sidekiq worker processes."

DIY: collect job execution metrics class SidekiqWorkerPrometheusMiddleware def call(worker, job,
queue) prometheus = Prometheus::Client.registry labels = { queue: queue, worker: worker.class } start = Time.now begin yield prometheus.get(:sidekiq_jobs_executed_total).increment(labels.merge(success: true)) rescue Exception # rubocop: disable Lint/RescueException prometheus.get(:sidekiq_jobs_executed_total).increment(labels.merge(success: false)) raise ensure prometheus.get(:sidekiq_job_runtime_seconds).observe(labels, (Time.now - start).round(3)) end end end

DIY: collect queue stats class SidekiqExporter < ::Prometheus::Middleware::Exporter def call(env)
refresh_sidekiq_stats if env["REQUEST_PATH"].start_with?(path) super end def refresh_sidekiq_stats r = Prometheus.registry stats = ::Sidekiq::Stats.new stats.queues.each do |k, v| r.get(:sidekiq_jobs_waiting_count).set({ queue: k }, v) end r.get(:sidekiq_active_workers_count).set({}, stats.workers_size) r.get(:sidekiq_jobs_scheduled_count).set({}, stats.scheduled_size) r.get(:sidekiq_jobs_dead_count).set({}, stats.dead_size) r.get(:sidekiq_active_processes).set({}, stats.processes_size) end end

DIY: Export metrics to prometheus Sidekiq.configure_server do |config| config.redis =
redis_credentials config.server_middleware do |chain| chain.add SidekiqWorkerPrometheusMiddleware end # Start server to expose metrics for Prometheus collector Thread.new do Rack::Handler::WEBrick.run( Rack::Builder.new do use Rack::CommonLogger, ::Sidekiq.logger use SidekiqExporter, registry: Prometheus.registry # Exporter from the previous slide run ->(_env) { [404, {"Content-Type" => "text/plain"}, ["Not Found\n"]] } end, Host: ENV["PROMETHEUS_EXPORTER_BIND"] || "0.0.0.0", Port: ENV.fetch("PROMETHEUS_EXPORTER_PORT", 9310), AccessLog: [] ) end end This is because Prometheus works in pull model (prometheus will collect metrics itself) while most other monitoring systems are working in push model (agents sends metrics themselves)

DIY: Success! curl http://localhost:9091/metrics # TYPE sidekiq_jobs_executed_total counter # HELP
sidekiq_jobs_executed_total A counter of the total number of jobs sidekiq executed. sidekiq_jobs_executed_total{queue="utils",worker="EmptyJob",success=”true”} 16 sidekiq_jobs_executed_total{queue="default",worker="WaitingJob",success=”true”} 160 # TYPE sidekiq_jobs_waiting_count gauge # HELP sidekiq_jobs_waiting_count The number of jobs waiting to process in sidekiq. sidekiq_jobs_waiting_count{queue="utils"} 2190 sidekiq_jobs_waiting_count{queue="default"} 1490 # TYPE sidekiq_job_runtime_seconds histogram # HELP sidekiq_job_runtime_seconds A histogram of the job execution time. sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="0.5"} 9.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="1"} 16.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="5"} 16.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="10"} 16.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="60"} 16.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="300"} 16.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="+Inf"} 16.0 sidekiq_job_runtime_seconds_sum{queue="utils",worker="EmptyJob"} 7.963

DIY: Visualize in Grafana

Easy, right?

So what? Finally, it’s time for the monogatari!

What about reason? It was outside of Sidekiq or Ruby.
It was in database

Debugging time – just a few minutes! ⏲

But don’t create too much alerts. Evaluate each alert’s usefulness
against CASE methodology. Alerts! Let robots analyze your metrics!

Cool! But can’t it be done easier?

Yabeda A set of Ruby libraries for easy collecting and
exporting application metrics. # Gemfile gem "yabeda-rails" # Yay! Standard RoR metrics already tracked! gem "yabeda-sidekiq" # Yay! Sidekiq metrics are registered and tracked! gem "yabeda-prometheus" # Now you can export metrics to Prometheus # But some configuration is still needed gem "yabeda-newrelic" # And to NewRelic at the same time

Yabeda: registering metrics Yabeda.configure do group :sidekiq do counter(:jobs_total_count, comment:
"Total number of jobs sidekiq executed.") counter(:jobs_success_count, comment: "Total number of jobs successfully processed by sidekiq.") counter(:jobs_failed_count, comment: "Total number of jobs failed in sidekiq.") histogram(:job_runtime, comment: "Job execution time.", unit: :seconds) gauge(:jobs_waiting_count, comment: "The number of jobs waiting to process in sidekiq.") end end

Yabeda: collecting metrics # Explicitly in the application code Yabeda.things_duration.measure({type:
:foo}, 10) Yabeda.things_happened_total.increment({type: :foo}) # By subscribing to some events ActiveSupport::Notifications.subscribe 'occurence.custom_things' do |*args| Yabeda.things_duration.measure({type: :foo}, 10) Yabeda.things_happened_total.increment({type: :foo}) end # Get values only when it is a time to send metrics to the monitoring system Yabeda.configure do collect do things_alive_total.set({}, Thing.alive.count) end end

What’s available out of the box? Plugins • Rails •
Sidekiq • Puma (thanks to Dmitry Salahutdinov) (request queueing time metric included) Adapters • Prometheus • NewRelic • DataDog (thanks to Dmitry Shvetsov) Want more? Search GitHub and RubyGems for Yabeda and ﬁnd more gems from the community. And, as always, you can create your own and let me know! Come and tell me after this talk, what plugin or adapter you want to be made next.

What does name “Yabeda” mean? “Yabeda” is a “tattletale” in
Russian – child that reports wrongdoings to adults. 「ヤベダ」は他の子供のいたずらを大人に言い付ける子供です。 Picture source: www.psychologos.ru/articles/view/yabeda Sidekiq Yabeda Prometheus Look, ma! Sidekiq is eating RAM again! Of course, this isn’t Sidekiq blame, but a side-eﬀect of memory handling in Ruby. Read part 5 and part 6 of “Sidekiq in Practice” series from Nate Berkopec on how to understand and ﬁx memory issues.

Yabeda A set of Ruby libraries to easily collect and
export application metrics. github.com/yabeda-rb Check an example yourself: github.com/yabeda-rb/example-prometheus

Conclusions With metrics collected and visualized, debugging becomes much easier
Production will still break, but restoration will be much faster Effectiveness of changes and optimizations can be tracked Everyone can answer the question “So, how we are going?” from the one sight

evilmartians.com twitter.com/evilmartians More in our blogs: evilmartians.com/chronicles dev.to/evilmartians Evil Martiansのステッカーを２階で取ってください
Questions? Ask them! 日本語でもいいです。 twitter.com/Envek github.com/Envek WE HAVE STICKERS! Please come and grab’em! Yabeda: github.com/yabeda-rb Read more about metric-driven development in these articles:

RubyKaigi 2019: Yabeda: monitoring monogatari

RubyKaigi 2019: Yabeda: monitoring monogatari

More Decks by Andrey Novikov

Other Decks in Programming

Featured

Transcript