Slide 1

Slide 1 text

Monitoring monogatari Andrey Novikov RubyKaigi 2019 Fukuoka, Japan

Slide 2

Slide 2 text

Hi, I’m Andrey and I’m from Korolyov, the Space capital of Russia.

Slide 3

Slide 3 text

The main ISS mission control room, Flights Control Center, Korolyov. Photo by NASA/Bill Ingalls

Slide 4

Slide 4 text

Korolyov Houston, we have a problem. * Have never actually heard such an expression.

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text


Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

12 Meanwhile somewhere in China

Slide 13

Slide 13 text

Korolyov, we have a problem.

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Questions we need to answer Do we executing more jobs than before? Does optimization or scaling help? Are we going to drain queues at all and when? ‍♂

Slide 16

Slide 16 text

What else do we want to know? How quick queues are growing and shrinking ⏳ How long it takes to execute different job types ⏰ When things started to go south (slowed down) What is the reason, after all?

Slide 17

Slide 17 text

What else to track to understand causes? Servers: CPU load, RAM usage, IO, average load, storage usage, … Web: request count, processing time, queuing time, … Queued jobs: jobs count, processing times, error rates, … Databases: queries count, processing time, returned rows count, … External APIs: requests count, processing times, error rates, … Network: latency, DNS lookup time, throughput, package drop rates, … Custom things: different metrics for every integration or whatever Business metrics: number of users, products, orders, , …

Slide 18

Slide 18 text

Ultimately, what do we want? More graphs for the god of graphs! …and second, third, or even fourth display, too…

Slide 19

Slide 19 text


Slide 20

Slide 20 text

20 Tu-144 airplane cockpit Tu-144 is a Soviet rival of the Concord

Slide 21

Slide 21 text

Now we need to learn some theory

Slide 22

Slide 22 text

What is monitoring, at least? Continuous process of observing and recording system parameters compared to predetermined criteria. “You can't control what you can't measure” – Tom DeMarco

Slide 23

Slide 23 text

Metric System parameter with numerical value Examples for Sidekiq: - number of jobs executed (jobs) - size of the queue (jobs) - average job processing duration (seconds)

Slide 24

Slide 24 text

Time series Data points (metric values) listed or graphed in time order. Common visualization of time series:

Slide 25

Slide 25 text

Labels Allows to segment metrics by some attributes and aggregate them again. Examples: Rails requests by controller and action, Sidekiq jobs executed by job class, Sidekiq queue sizes by queue names. Warning: for every unique set of labels separate time series is generated!

Slide 26

Slide 26 text

Counter Tracks number of events occured Examples: number of web requests, network packets, Sidekiq jobs executed CRDT: G-Counter – monotonically increasing counter. Allows to detect application restarts and reliably aggregate values from any number of machines or processes Base type for more complex types like histogram

Slide 27

Slide 27 text

Gauges Tracks state of external world. No restrictions on its changing Examples: temperature, CPU load, RAM usage, Sidekiq queue size

Slide 28

Slide 28 text

Histogram and summary Complex types to get rough statistics about duration of some events like average time, 95th percentile, maximums and minimums Sidekiq example: job execution duration (seconds) Histogram: set of counters for number of jobs matching predefined “buckets” Summary: precalculated aggregations by client (like 95% percentile) They’re not precise! Many monitoring systems doesn’t allow to choose between these types.

Slide 29

Slide 29 text

That’s all for theory! Let’s go to the real job! ‍

Slide 30

Slide 30 text

Do It Yourself with Prometheus # Gemfile gem "prometheus-client", "~> 0.8"

Slide 31

Slide 31 text

DIY: metric registration prometheus = Prometheus::Client.registry prometheus.counter :sidekiq_jobs_executed_total, "Total number of jobs sidekiq executed." prometheus.histogram :sidekiq_job_runtime_seconds, "A histogram of the job execution time." prometheus.gauge :sidekiq_jobs_waiting_count, "The number of jobs waiting to process in sidekiq." prometheus.gauge :sidekiq_active_workers_count, "The number of currently running machines with ..." prometheus.gauge :sidekiq_jobs_scheduled_count, "The number of jobs scheduled for later execution." prometheus.gauge :sidekiq_jobs_retry_count, "The number of failed jobs waiting to be retried" prometheus.gauge :sidekiq_jobs_dead_count, "The number of jobs exceeded their retry count." prometheus.gauge :sidekiq_active_processes_count, "The number of active Sidekiq worker processes."

Slide 32

Slide 32 text

DIY: collect job execution metrics class SidekiqWorkerPrometheusMiddleware def call(worker, job, queue) prometheus = Prometheus::Client.registry labels = { queue: queue, worker: worker.class } start = begin yield prometheus.get(:sidekiq_jobs_executed_total).increment(labels.merge(success: true)) rescue Exception # rubocop: disable Lint/RescueException prometheus.get(:sidekiq_jobs_executed_total).increment(labels.merge(success: false)) raise ensure prometheus.get(:sidekiq_job_runtime_seconds).observe(labels, ( - start).round(3)) end end end

Slide 33

Slide 33 text

DIY: collect queue stats class SidekiqExporter < ::Prometheus::Middleware::Exporter def call(env) refresh_sidekiq_stats if env["REQUEST_PATH"].start_with?(path) super end def refresh_sidekiq_stats r = Prometheus.registry stats = stats.queues.each do |k, v| r.get(:sidekiq_jobs_waiting_count).set({ queue: k }, v) end r.get(:sidekiq_active_workers_count).set({}, stats.workers_size) r.get(:sidekiq_jobs_scheduled_count).set({}, stats.scheduled_size) r.get(:sidekiq_jobs_dead_count).set({}, stats.dead_size) r.get(:sidekiq_active_processes).set({}, stats.processes_size) end end

Slide 34

Slide 34 text

DIY: Export metrics to prometheus Sidekiq.configure_server do |config| config.redis = redis_credentials config.server_middleware do |chain| chain.add SidekiqWorkerPrometheusMiddleware end # Start server to expose metrics for Prometheus collector do do use Rack::CommonLogger, ::Sidekiq.logger use SidekiqExporter, registry: Prometheus.registry # Exporter from the previous slide run ->(_env) { [404, {"Content-Type" => "text/plain"}, ["Not Found\n"]] } end, Host: ENV["PROMETHEUS_EXPORTER_BIND"] || "", Port: ENV.fetch("PROMETHEUS_EXPORTER_PORT", 9310), AccessLog: [] ) end end This is because Prometheus works in pull model (prometheus will collect metrics itself) while most other monitoring systems are working in push model (agents sends metrics themselves)

Slide 35

Slide 35 text

DIY: Success! curl http://localhost:9091/metrics # TYPE sidekiq_jobs_executed_total counter # HELP sidekiq_jobs_executed_total A counter of the total number of jobs sidekiq executed. sidekiq_jobs_executed_total{queue="utils",worker="EmptyJob",success=”true”} 16 sidekiq_jobs_executed_total{queue="default",worker="WaitingJob",success=”true”} 160 # TYPE sidekiq_jobs_waiting_count gauge # HELP sidekiq_jobs_waiting_count The number of jobs waiting to process in sidekiq. sidekiq_jobs_waiting_count{queue="utils"} 2190 sidekiq_jobs_waiting_count{queue="default"} 1490 # TYPE sidekiq_job_runtime_seconds histogram # HELP sidekiq_job_runtime_seconds A histogram of the job execution time. sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="0.5"} 9.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="1"} 16.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="5"} 16.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="10"} 16.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="60"} 16.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="300"} 16.0 sidekiq_job_runtime_seconds_bucket{queue="utils",worker="EmptyJob",le="+Inf"} 16.0 sidekiq_job_runtime_seconds_sum{queue="utils",worker="EmptyJob"} 7.963

Slide 36

Slide 36 text

DIY: Visualize in Grafana

Slide 37

Slide 37 text

Easy, right?

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

So what? Finally, it’s time for the monogatari!

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

Korolyov, we have a problem.

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

What about reason? It was outside of Sidekiq or Ruby. It was in database

Slide 47

Slide 47 text

Debugging time – just a few minutes! ⏲

Slide 48

Slide 48 text

But don’t create too much alerts. Evaluate each alert’s usefulness against CASE methodology. Alerts! Let robots analyze your metrics!

Slide 49

Slide 49 text

Korolyov, we have a problem.

Slide 50

Slide 50 text

Cool! But can’t it be done easier?

Slide 51

Slide 51 text

Yabeda A set of Ruby libraries for easy collecting and exporting application metrics. # Gemfile gem "yabeda-rails" # Yay! Standard RoR metrics already tracked! gem "yabeda-sidekiq" # Yay! Sidekiq metrics are registered and tracked! gem "yabeda-prometheus" # Now you can export metrics to Prometheus # But some configuration is still needed gem "yabeda-newrelic" # And to NewRelic at the same time

Slide 52

Slide 52 text

Yabeda: registering metrics Yabeda.configure do group :sidekiq do counter(:jobs_total_count, comment: "Total number of jobs sidekiq executed.") counter(:jobs_success_count, comment: "Total number of jobs successfully processed by sidekiq.") counter(:jobs_failed_count, comment: "Total number of jobs failed in sidekiq.") histogram(:job_runtime, comment: "Job execution time.", unit: :seconds) gauge(:jobs_waiting_count, comment: "The number of jobs waiting to process in sidekiq.") end end

Slide 53

Slide 53 text

Yabeda: collecting metrics # Explicitly in the application code Yabeda.things_duration.measure({type: :foo}, 10) Yabeda.things_happened_total.increment({type: :foo}) # By subscribing to some events ActiveSupport::Notifications.subscribe 'occurence.custom_things' do |*args| Yabeda.things_duration.measure({type: :foo}, 10) Yabeda.things_happened_total.increment({type: :foo}) end # Get values only when it is a time to send metrics to the monitoring system Yabeda.configure do collect do things_alive_total.set({}, Thing.alive.count) end end

Slide 54

Slide 54 text

What’s available out of the box? Plugins ● Rails ● Sidekiq ● Puma (thanks to Dmitry Salahutdinov) (request queueing time metric included) Adapters ● Prometheus ● NewRelic ● DataDog (thanks to Dmitry Shvetsov) Want more? Search GitHub and RubyGems for Yabeda and find more gems from the community. And, as always, you can create your own and let me know! Come and tell me after this talk, what plugin or adapter you want to be made next.

Slide 55

Slide 55 text

What does name “Yabeda” mean? “Yabeda” is a “tattletale” in Russian – child that reports wrongdoings to adults. 「ヤベダ」は他の子供のいたずらを大人に言い付ける子供です。 Picture source: Sidekiq Yabeda Prometheus Look, ma! Sidekiq is eating RAM again! Of course, this isn’t Sidekiq blame, but a side-effect of memory handling in Ruby. Read part 5 and part 6 of “Sidekiq in Practice” series from Nate Berkopec on how to understand and fix memory issues.

Slide 56

Slide 56 text

Yabeda A set of Ruby libraries to easily collect and export application metrics. Check an example yourself:

Slide 57

Slide 57 text

Conclusions With metrics collected and visualized, debugging becomes much easier Production will still break, but restoration will be much faster Effectiveness of changes and optimizations can be tracked Everyone can answer the question “So, how we are going?” from the one sight

Slide 58

Slide 58 text More in our blogs: Evil Martiansのステッカーを2階で取ってください Questions? Ask them! 日本語でもいいです。 WE HAVE STICKERS! Please come and grab’em! Yabeda: Read more about metric-driven development in these articles: