Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ACIDic Jobs: A Layman's Guide to Job Bliss (Speaker Notes)

Stephen
November 07, 2021

ACIDic Jobs: A Layman's Guide to Job Bliss (Speaker Notes)

Background jobs have become an essential component of any Ruby infrastructure, and, as the Sidekiq Best Practices remind us, it is essential that jobs be "idempotent and transactional." But how do we make our jobs idempotent and transactional? In this talk, we will explore various techniques to make our jobs robust and ACIDic.

Stephen

November 07, 2021
Tweet

More Decks by Stephen

Other Decks in Programming

Transcript

  1. Stephen Margheim — RubyConf 2021 ACIDic Jobs A Layman's Guide

    to Job Bliss It is an honor to be here today and to be able to talk with you all—those of you here in person, those on the live-stream, and those who might be watching in the future—about how we might be able to achieve job bliss
  2. Stephen Margheim @fractaledmind My name is Stephen. You can find

    me on Twitter @fractaledmind, tho fair warning I am more of a voyeur than an author there. I have somehow found myself these days working across 3 different projects • I am the head of engineering at test IO, which offers crowd-driven QA testing as a service to other companies • I am also a consulting engineer for RCRDSHP, a web3 company in the music NFT space • I am also building Smokestack QA on the side, which is an application that helps to integrate manual QA into teams' GitHub Pull Requests But, enough about my jobs, let's get to the topic at-hand — jobs
  3. Jobs are essential in every company, with every app, jobs

    are essential. Because jobs are what your app *does*, expressed as a distinct unit And they are powerful. jobs can be called from anywhere, run sync or async, and have retry mechanisms built-in
  4. Job == Verb Job != ActiveJob object First, when I

    say "job" I do not mean an instance of ActiveJob. jobs are verbs, in the same way models are nouns
  5. Job == State Mutation Job != inspection or retrieval of

    information More specifically, jobs represent those verbs that change the state of your system
  6. Job => Side Effects Job !=> return value and jobs

    are state mutations that produce side effects. We typically do not care about their return value, since jobs are most often run async, the return value is never returned to the caller. since jobs are most often state mutations and state is most often stored in some persistent data store, jobs must produce the side-effect of storing new state
  7. • a Ruby class object, • representing a state mutation

    action, • that takes as input a representation of initial state • and produces side-effects representing a next state. A job is Thus, we can say that a job is: ... This definition of "job" encompasses a wide variety of patterns that have emerged in the Ruby ecosystem
  8. ActionDoer.call *arguments # service object ActionJob.perform_now *arguments # active job

    ActionWorker.new.perform *arguments # sidekiq worker ActionOperation.run *arguments # operation class for the rest of the talk, I am going to use the language of "jobs" and the interface of ActiveJob, but the principles we will be exploring apply to all of these various ways to expressing a state mutation as a Ruby object So, how do we build jobs *well*?
  9. Jobs must be idempotent & transactional Mike Perham, in the

    Sidekiq docs, reminds us of this key truth: ... These are the two key characteristics of a well-built job.
  10. Operation 1 Operation 2 Operation 3 Transaction A transaction is

    a collection of operations, typically with ACIDic guarantees
  11. • Atomicity: everything succeeds or everything fails • Consistency: the

    data always ends up in a valid state, as defined by your schema • Isolation: concurrent transactions won't conflict with each other • Durability: once committed always committed, even with system failures The ACIDic Guarantees The ACIDic guarantees are the foundational characteristics needed for correct and precise state mutations ... > In 1983, Andreas Reuter and Theo Härder coined the acronym ACID as shorthand for atomicity, consistency, isolation, and durability. They were building on earlier work by Jim Gray who’d proposed atomicity, consistency, and durability, but had initially left out the I. ACID is one of those inventions of the 80s that’s not only just still in use in the form of major database systems like Postgres, Oracle, and MSSQL, but which has never been displaced by a better idea.
  12. Jobs · Databases Because SQL databases give us ACIDic transactions

    **for free** We can lean on the power and resilience of SQL databases to help make our jobs ACIDic. > "I want to convince you that ACID databases are one of the most important tools in existence for ensuring maintainability and data correctness in big production systems"
  13. Idempotency f(f(x)) == f(x) Job.perform & Job.perform == Job.perform As

    for idempotency, for something to be "idempotent" it needs to be safely repeatable
  14. • Functional Idempotency: the function always returns the same result,

    even if called multiple times f(f(x)) == f(x) • Practical Idempotency: the side-effect(s) will happen once and only once, no matter how many times the job is performed Job.perform == Job.perform & Job.perform The Idempotent Guarantee idempotency is often defined in terms of pure functions, which while mathematically interesting, is not particularly helpful to us, since return values aren't typically meaningful to jobs. For jobs, we can use a more practical definition focused on side-effects. ...
  15. Jobs · Retries Most job backends give us automatic retries

    **for free** This means that if we make our jobs idempotent, we can lean on the power of the retry mechanism to ensure eventual correctness.
  16. class JobRun < ActiveRecord::Base # ... end Now, the core

    idea I want to explore with you all today is how this class can be used and leveraged to provide flexible ways to create jobs of various degrees of complexity with these characteristics.
  17. Nathan Griffith provides an excellent overview of how to write

    transactional and idempotent jobs in his RailsConf talk from earlier this year; I highly recommend you go and check it out, tho I will briefly summarize his key points. His examples are save methods for synthetic models, but the principles are the same, and based on our definition, his save methods are jobs in-the-making
  18. • Use a transaction to guarantee atomic execution • Use

    locks to prevent concurrent data access • Use idempotency and retries to ensure eventual completion • Ensure enqueuing other jobs is co-transactional • Split complex operations into steps ACIDic Job Principles — Nathan Griffith Nathan hones in on these 5 core characteristics of resilient jobs: ... I whole-heartedly agree with this assessment. What I want to explore today is how we can build more generalized tools and patterns to help us more easily ensure our jobs conform to these principles. So, let's consider Nathan's list and work our way thru it, exploring how to build a toolset for writing truly ACIDic jobs.
  19. ACIDic Jobs Level 0 — Transactional Jobs So, let's start

    with the first two. How can we build transactional jobs generically and flexibly?
  20. def perform(from_account, to_account, amount) run = JobRun.find_or_create_by!( job_class: self.class, job_id:

    job_id, job_args: [from_account, to_account, amount]) run.with_lock do from_account.lock! to_account.lock! from_account.update!(balance: from_account.balance - amount) to_account.update!(balance: to_account.balance - amount) end end By using a database record representing this particular job run, we have a mechanism for a database transaction that can be used in any job. Plus, we can also mitigate concurrency issues by locking the database row for this particular job run.
  21. — Mike Perham “Just remember that Sidekiq will execute your

    job at least once, not exactly once.” This is important because even Sidekiq, the titan in the Ruby job backend world, makes only an at-least-once guarantee for doing work.
  22. • Use a database record to make job runs transactional

    • Use a database lock to mitigate concurrency issues ACIDic Jobs Level 0 Recap The first step to making resilient and robust jobs, jobs that are transactional, is to be sure to always wrap the operation in a database transaction and to use database locks to mitigate concurrency issues. A generic `JobRun` class allows us to wrap any and all jobs in such locked database transactions, and thus provides a solid foundation for our jobs.
  23. ACIDic Jobs Level 1 — Idempotent Jobs Next, let's consider

    how we can make our transactional jobs idempotent.
  24. Idempotency & Uniqueness To guarantee idempotency, we must be able

    to define and identify the job uniquely For a job to be idempotent, it requires being able to uniquely identify the unit of work being done
  25. john_account.balance # initial state # => 100_00 TransferBalanceJob.perform_later(john_account, jane_account, 10_00)

    TransferBalanceJob.perform_later(john_account, jane_account, 10_00) john_account.balance # resulting state # => 80_00 or 90_00 ? Let's consider as an example running a balance transfer job twice with the same arguments. The core question for building idempotency into our job is what is the correct resulting state? How do we tell when the system is executing a job multiple times invalidly versus when the system is executing a job multiple times validly?
  26. • each job run uses a generic unique entity representing

    this job run • each job run uses a generic unique entity representing this job execution (based on args) Forms of Job Uniqueness There are basically 2 different ways to make your job have a sense of uniqueness Firstly, you can treat each run of the job as a unique entity, or you could treat each execution of the job as a unique entity. What do I mean? Well, ...
  27. def perform(from_account, to_account, amount) run = JobRun.find_or_create_by!(job_class: self.class, job_id: job_id)

    run.with_lock do return if run.completed? from_account.update!(balance: from_account.balance - amount) to_account.update!(balance: to_account.balance - amount) run.update!(completed_at: Time.current) end end Unique Job by Job Run For our first example, let's imagine our `JobRun` model has a uniqueness constraint on the union of `job_class` and `job_id` and keeps track of when job runs are completed. With just a bit of boilerplate, we have a job that once enqueued will only ever execute the operation once, no matter if Sidekiq picks that job off the queue multiple times.
  28. john_account.balance # initial state # => 100_00 TransferBalanceJob.perform_later(john_account, jane_account, 10_00)

    TransferBalanceJob.perform_later(john_account, jane_account, 10_00) john_account.balance # resulting state # => 80_00 ? This approach treats each separate enqueuing as a valid enqueuing, even with duplicate arguments. When using this strategy for job uniqueness we trust enqueuing but are wary of dequeuing. We could imagine the opposite tho, and this is the second strategy.
  29. def perform(from_account, to_account, amount) run = JobRun.find_or_create_by!(job_class: self.class, job_args: [from_account,

    to_account, amount]) run.with_lock do return if run.completed? from_account.update!(balance: from_account.balance - amount) to_account.update!(balance: to_account.balance - amount) run.update!(completed_at: Time.current) end end Unique Job by Execution Args We could imagine a slightly different `JobRun` model that cares about the unique union of job_class and job_args. In this situation, it wouldn't matter how many times a job is enqueued or dequeued, it will only execute the operation once. This strategy is essentially the idempotent job version of memoization.
  30. john_account.balance # initial state # => 100_00 TransferBalanceJob.perform_later(john_account, jane_account, 10_00)

    TransferBalanceJob.perform_later(john_account, jane_account, 10_00) john_account.balance # resulting state # => 90_00 ? So, with this "memoization" strategy, we would treat enqueuing with duplicate arguments as an invalid enqueuing. Thus we don't necessarily trust the enqueuing code to be trustworthy. This strategy is more cautious than the first strategy, but that doesn't make it necessarily better. There are certainly units of work that rightfully should do the same thing multiple times. In our balance transfer example, it is perfectly reasonable to think that John might give Jane $10 once and then rightfully give her another $10 later.
  31. Unique Job flexibly and generically class TransferBalanceJob < ApplicationJob prepend

    UniqueByJobRun uniquely_identified_by_job_id # uniquely_identified_by_job_args def perform(from_account, to_account, amount) from_account.lock! to_account.lock! from_account.update!(balance: from_account.balance - amount) to_account.update!(balance: to_account.balance - amount) end end But, with a bit of work, we could build a job concern that allows each job to declare how it should be uniquely identified, all while still relying simply on our `JobRun` model. Such an approach would allow us to make our `TransferBalanceJob` behave in whichever way was correct for our system ...
  32. john_account.balance # initial state # => 100_00 TransferBalanceJob.perform_later(john_account, jane_account, 10_00)

    TransferBalanceJob.perform_later(john_account, jane_account, 10_00) john_account.balance # resulting state # => 80_00 ... whether that is allowing the system to enqueue multiple runs of the same job with the same arguments
  33. Unique Job flexibly and generically class TransferBalanceJob < ApplicationJob prepend

    UniqueByJobRun # uniquely_identified_by_job_id uniquely_identified_by_job_args def perform(from_account, to_account, amount) from_account.lock! to_account.lock! from_account.update!(balance: from_account.balance - amount) to_account.update!(balance: to_account.balance - amount) end end Or, if the job was uniquely identified by its job args, ...
  34. john_account.balance # initial state # => 100_00 TransferBalanceJob.perform_later(john_account, jane_account, 10_00)

    TransferBalanceJob.perform_later(john_account, jane_account, 10_00) john_account.balance # resulting state # => 90_00 ... constraining the system to only execute the operation once, even if enqueued multiple times.
  35. • Use a database record to make job runs idempotent

    • custom or generic • by job ID or by job args ACIDic Jobs Level 1 Recap Thus, we could use our `JobRun` class as the foundation for building both transactionality and idempotency into our jobs. But, thus far, we have only considered jobs whose operations are only database writes. Often, our jobs need to do more.
  36. ACIDic Jobs Level 2 — Enqueuing other Jobs One of

    the most common additional tasks required of jobs is to enqueue other jobs
  37. uniquely_identified_by_job_id def perform(from_account, to_account, amount) from_account.lock! to_account.lock! from_account.update!(balance: from_account.balance -

    amount) to_account.update!(balance: to_account.balance - amount) TransferMailer.with(account: from_account).outbound.deliver_later TransferMailer.with(account: to_account).inbound.deliver_later end We can imagine that after transferring balance, we need to send out confirmation emails to both parties. As it stands, this code is susceptible to problems.
  38. Failure Condition 1 job queue from.save! job process to.save! TransferMailer.deliver_later

    transaction commits job starts job fails job queued by web process and dequeued by background worker The first possible problem is that Sidekiq is simply too fast and enqueues, dequeues, and executes the job before the original job's database transaction commits. Because our outer job is ACIDic, this second job won't be able to see the state of the database created by the transaction until that transaction commits. While annoying, this problem at least naturally resolves, as the retry mechanism will try this second job again, and at some point that transaction will commit and the job will eventually succeed.
  39. Failure Condition 2 job queue from.save! job process to.save! TransferMailer.deliver_later

    job starts job fails rollback job queued by web process and dequeued by background worker The second problem is related, in that it is due to enqueuing a job within a transaction, but is also more pernicious. Imagine that instead of simply taking a while for the transaction to commit, the transaction actually rolls back. In this case the state mutations will be discarded completely, and the job inserted into the queue will never succeed no matter how many times it's retried.
  40. Solution Option 1 • A database-backed job queue • delayed_job

    • que • good_job But... no Sidekiq and increased db load One solution to these problems is to simply use a database-backed queue for all of your jobs, which makes all job enqueuing co-transactional. This is Nathan's suggestion. However, this means no Sidekiq, and for me that is a non-starter, as Sidekiq offers more than just a job backend.
  41. The second option is a pattern I first learned of

    from Brandur Leach in a blog post that is truly execellent and well worth your time. In fact, his entire blog is excellent and worth your time.
  42. Solution Option 2 • A transactionally-staged job queue So... more

    Sidekiq and minimal increased db load Brandur lays out the core value proposition clearly: make job enqueuing co-transactional by "staging" jobs in the database before enqueuing them in your background queue.
  43. uniquely_identified_by_job_id def perform(from_account, to_account, amount) from_account.lock! to_account.lock! from_account.update!(balance: from_account.balance -

    amount) to_account.update!(balance: to_account.balance - amount) TransferMailer.with(account: from_account).outbound.deliver_acidic TransferMailer.with(account: to_account).inbound.deliver_acidic end What if we could make it as easy to stage a job as it is to enqueue it? I was surprised at how little code we need to achieve this. For our example, we can extend `ActionMailer::MessageDelivery` to add a method to stage the job in a database record
  44. def deliver_acidic(options = {}) job = delivery_job_class attributes = {

    adapter: "activejob", job_name: job.name } job_args = if job <= ActionMailer::Parameterized::MailDeliveryJob [@mailer_class.name, @action, "deliver_now", {params: @params, args: @args}] else [@mailer_class.name, @action, "deliver_now", @params, *@args] end attributes[:job_args] = job.new(job_args).serialize StagedJob.create!(attributes) end Our custom delivery method does a bit of work to handle the different kinds of ActionMailer deliveries, but at its heart all it essentially does is create a database record, which will respect the transactional boundary.
  45. class StagedJob < ActiveRecord::Base after_create_commit :enqueue_job def enqueue_job case adapter

    when "activejob" ActiveJob::Base.deserialize(job_args).enqueue when "sidekiq" Sidekiq::Client.push("class" => job_name, "args" => job_args) end end end And that database record can then enqueue the job via an ActiveRecord callback Nathan and Brandur both actually imagine this pattern as requiring an independent process to de-stage staged jobs and enqueue them. But, a little bit of ActiveRecord magic allows us to have our cake and eat it too.
  46. • Use transactionally-staged jobs to keep job enqueuing co-transactional with

    standard database operations • while keeping Sidekiq, and • not requiring an independent de-staging process ACIDic Jobs Level 2 Recap So, by using transactionally-staged jobs, we can keep the ACIDic guarantees provided by our database transaction, keep using Sidekiq, and not need an independent de-staging process.
  47. def perform(order) order.lock! order.process_and_fulfill! ShopifyAPI::Fulfillment.create!({ amount: order.amount, customer: order.purchaser, })

    OrderMailer.with(order: order).fulfilled.deliver_acidic end A standard example would be fulfilling an order in a Shopify store. You receive the webhook for the order, ...
  48. def perform(order) order.lock! order.process_and_fulfill! ShopifyAPI::Fulfillment.create!({ amount: order.amount, customer: order.purchaser, })

    OrderMailer.with(order: order).fulfilled.deliver_acidic end you process that order and do whatever all you need to do in your database, ...
  49. def perform(order) order.lock! order.process_and_fulfill! ShopifyAPI::Fulfillment.create!({ amount: order.amount, customer: order.purchaser, })

    OrderMailer.with(order: order).fulfilled.deliver_acidic end And only after that transaction successfully commits do we start with step 2, telling Shopify that the order has been fulfilled ...
  50. def perform(order) order.lock! order.process_and_fulfill! ShopifyAPI::Fulfillment.create!({ amount: order.amount, customer: order.purchaser, })

    OrderMailer.with(order: order).fulfilled.deliver_acidic end And only after we successfully create the Shopify fulfillment do we want to send the email notification
  51. def perform(order) order.lock! order.process_and_fulfill! ShopifyAPI::Fulfillment.create!({ amount: order.amount, customer: order.purchaser, })

    OrderMailer.with(order: order).fulfilled.deliver_acidic end But what happens if we can't, for some reason (like our email service provider is experiencing downtime), send out the notification email?
  52. def perform(order) order.lock! order.process_and_fulfill! ShopifyAPI::Fulfillment.create!({ amount: order.amount, customer: order.purchaser, })

    OrderMailer.with(order: order).fulfilled.deliver_acidic end ? When the job retries, will we fulfill this order a second time, as if the user had purchased the same product twice?
  53. Another Brandur Leach blog post can help us navigate out

    of this tricky situation, showing us how to break our complex job workflow into transactional steps
  54. Workflow step 1 step 2 step 3 In our example

    workflow, we have 3 steps that are serially-dependent
  55. Workflow — Run 1 step 1 step 2 step 3

    If, on the first run, the first step succeeds but the second step fails
  56. Workflow — Run 2 step 1 step 2 step 3

    On the second run, the first step will be skipped altogether, and the retry will jump straight to the second step.
  57. Job-wise vs Step-wise Idempotency This is what it means to

    move from job-wise to step-wise idempotency ...
  58. Job-wise vs Step-wise vs Idempotency ... and I won't even

    get into what all it would take to make our jobs "pennywise idempotent" ...
  59. uniquely_identified_by_job_args def perform(order) @job.with_lock do order.process_and_fulfill! @job.update!(recovery_point: :fulfill_order) end if

    @job.recovery_point == :start # ... end But, we can make the workflow at least step-wise idempotent by again leveraging the power of a database record representing the job execution. If we add a `recovery_point` column to the record, we can track which steps in the workflow have successfully been completed. Presuming the job record is created with the value initially set to `:start`, we then simply update the column at the end of the step. We then guard the execution of that step with the value of this column.
  60. uniquely_identified_by_job_args def perform(order) # ... @job.with_lock do ShopifyAPI::Fulfillment.create!({ ... })

    @job.update!(recovery_point: :send_email) end if @job.recovery_point == :fulfill_order # ... end Then, we guard the second step operation with the recovery point name of the second step, and again update the job record with the name of the next recovery point at the end of the step.
  61. uniquely_identified_by_job_args def perform(order) # ... @job.with_lock do OrderMailer.with(order: order).fulfilled.deliver_acidic @job.update!(recovery_point:

    :finished) end if @job.recovery_point == :send_email end In the final step, we keep the same guarding logic but update the recovery point to `:finished` upon step completion.
  62. include WithAcidity def perform(order) with_acidity do step :process_order step :fulfill_order

    step :send_emails end end def process_order; # ... end def fulfill_order; # ... end def send_emails; # ... end Imagine we could have 1 Workflow Job that provides a clear overview of all of the steps and how they flow, but is also step-wise idempotent. Moreover, we wouldn't have to repeat the boilerplate of the job lock transaction and the recovery-key updates.
  63. module WithAcidity def perform_step(current_step_method, next_step_method) return unless @job.recovery_point == current_step_method

    @job.with_lock do method(current_step_method).call @job.update!(recovery_point: next_step_method) end end end We could imagine a relatively simple heart to another concern that provides this DSL, which simply calls the step method and updates the job record recovery point within a transaction. Once again, our `JobRun` class and our locked database transactions form the foundation of an increasingly powerful and flexible set of tools for building resilient and robust jobs.
  64. • Use a recovery key to keep track of which

    steps in a workflow job have been successfully completed • make each step ACIDic, and • keep the entire workflow job ACIDic ACIDic Jobs Level 3 Recap So, by using this recovery key field on our `JobRun` records, we can move the ACIDic guarantees provided by our database transaction down into each individual step of the workflow, and thus allow the entire workflow job to remain sufficiently transactional and step-wise idempotent
  65. ACIDic Jobs Level 4 — Step Batches The problem with

    the step-wise idempotent workflow is that all work is done sequentially, in the same worker, in the same queue. What if we have work that can be done in parallel, or work that is better done on a different queue?
  66. def perform(order) with_acidity do step :process_order step :fulfill_order, awaits: [ShopifyFulfillJob]

    step :send_emails end end For example, instead of blocking our workflow queue with the external API call to Shopify, what if we could simply call a job that runs on a separate queue, but still not move onto to step 3 until that job succeeds?
  67. — Mike Perham “Batches are Sidekiq Pro's [tool to] create

    a set of jobs to execute in parallel and then execute a callback when all the jobs are finished.” Sidekiq Pro offers the amazing Batches feature, which provides—among other things—a callback for when a specified collection of jobs are all successfully finished. This provides a mechanism for adding a wonderful new layer of power to our jobs
  68. Parallel Executing + Workflow Blocking It allows us to define

    steps that have parallel executing jobs, but the step is still blocking the workflow from moving onto the next step
  69. Workflow Job A Shopify Fulfill Job A Workflow Job B

    Workflow Job A Thus, we could have a 3 step workflow that actually executes the second step on a separate queue, allowing another workflow job to start in parallel, while still ensuring that step 3 doesn't start until that separate job succeeds.
  70. • Use Sidekiq Batches to allow parallel, separately queued jobs

    to be used within a multi-step workflow • keep steps serially dependent • while allowing for parallelizatino ACIDic Jobs Level 4 Recap So, by leveraging the power of Sidekiq Batches, we can take our ACIDic jobs to the next level and enable steps to compose parallel executing jobs that can run on separate queues, while the workflow as a whole still retains its step-wise idempotency and serial dependence.
  71. I am working to provide all of these various techniques

    and tools for building out increasingly complex jobs, all while maintaining transactionality and idempotency, in a new gem that I call `acidic_job`. It is still in a pre-1.0 state, but we are using it in production in all of my various work projects successfully. I have no doubt that the community, that you all, could help me to bring this to 1.0 and provide the Ruby ecosystem with a powerful and flexible toolset for building resilient, robust, acidic jobs. END