Erik's Engineering

something alliterative

Sidekiq Queueing Patterns

1 Intro

It seems like almost every app or service needs background processing of some kind. Over the years I've used quite a few different background job processing systems, ranging from home-made systems that used specialized message queueing middleware to delayed_job, resque, and sidekiq.

At this point I'm whole-heartedly in favor of sidekiq and tend to assume that each new service will need sidekiq workers in addition to whatever else they do.

Of course, once you have sidekiq workers you need to figure out how to organize your jobs. Solutions can range from incredibly simple to incredibly complicated. The right configuration will very much depend on the specific application.

Unfortunately, the shortcomings of a given set up will usually show up under heavy load and you probably won't realize you've got a problem until you've got some very full queues.

Let's run down some configurations.

2 All Jobs in One Queue

This is the default. You probably use queue: default.

:concurrency: 5
:pidfile: tmp/pids/sidekiq.pid
staging:
  :concurrency: 10
production:
  :concurrency: 20
:queues:
  - default

2.1 You Get

  • Very simple. You put things into a queue, and they run.
  • Jobs run roughly in the order that they enter the queue. I say roughly, because you can't count on strict ordering when you have multiple workers.
  • Just make sure you always have enough workers to keep up.
  • Most of the time things work out just fine after a while.

2.2 But

What if the queue starts growing? If one of your jobs starts taking too long (perhaps Bookface is having an API outage, so requests are timing out - taking your job from 3 seconds to 60 seconds), those jobs will keep anything else from running on your worker threads during that whole time. Eventually one of these "greedy" jobs will occupy each of your workers and create a logjam that keeps any of your other jobs (which DON'T have a problem) from running.

At this point you're stuck.

  • You need to figure out what job is having trouble, and limit its impact on all the other jobs.
  • Adding more sidekiq workers won't help, because they too will quickly be occupied by your slow jobs. Autoscaling in this case will just waste resources.
  • Sidekiq UI and the sidekiq API let you find queue sizes, but don't tell you what jobs are in those queues.
  • Sidekiq gives you great tools for managing queues, but not types of jobs.
  • Even if you have high and low priority queues, you can't easily move jobs between them.

3 One Queue Per Job Type, All The Same Priority

You can give each worker class a queue named after it. I'd recommend doing this from the start.

:concurrency: 5
:pidfile: tmp/pids/sidekiq.pid
staging:
  :concurrency: 10
production:
  :concurrency: 20
queues:
  - ["book_face_frob_job", 1]
  - ["welcome_email_job", 1]

Listing queues with those numbers after them makes sidekiq treat them as all about the same priority. You can weight them a bit by using higher numbers, but basically you're saying that the book_face_frob_job doesn't have to get processed before the welcome_email_job.

If you install the sidekiq-limit_fetch gem, you can tell sidekiq to run in "dynamic" mode, which will process any job it sees, with no need to pre-configure a list of queues.

:concurrency: 5
:pidfile: tmp/pids/sidekiq.pid
staging:
  :concurrency: 10
production:
  :concurrency: 20
:dynamic: true

This way you won't have to scratch your head after adding a new job while you try to figure why you can't get it to run.

3.1 You Get

  • Pretty simple
  • No gotcha case as you add new kinds of worker (if you go dynamic)
  • Visibility via sidekiq-ui into how many of each type of job are waiting to run
  • The numbers associated with queue names are called weights, and you can use them to tell sidekiq to run specific queues a little more often than others by giving them higher numbers.

3.2 But

What if you need to backfill a bunch of data? You write a worker to do it, and you queue up 100,000 of those jobs.

Report.where(type: "UpdatedReport").find_each do |report|
  UpdateReport.perform_async(report.id)
end

This is actually a fairly common use case. You already have a pool of workers that are (probably) not 100% utilized, so let's just use them to backfill this data. Common reasons are things like

  • Changing NoSQL document formats
  • Regenerating thumbnails as part of a site redesign

The main thing is that you want to do all of this in the background without making all your other jobs wait for these 100,000 backfill jobs to complete. You need to make sure they run at lower priority than everything else.

4 Explicitly Prioritized Queues

When you run into this, you need to take control of the relative priorities of your jobs. Some people will create high and low priority queues, but you'll remember that is very limiting when you want to figure out which kind of job is bloating your queue.

Remember that dynamic mode runs queues not listed in `sidekiq.yml` at lowest priority, so you're going to need to list all your queues so you can make them higher priority than your data backfill jobs.

:concurrency: 5
:pidfile: tmp/pids/sidekiq.pid
staging:
  :concurrency: 10
production:
  :concurrency: 20
queues:
  - book_face_frob_job
  - welcome_email_job
  - data_backfill_job

4.1 You Get

  • If you list queues, those will run before unlisted queues. List your "known" important jobs, and those backfill jobs won't interfere with them.
  • UI and sidekiq API will let you see which job type is filling up your queue, and in the case of slow jobs which jobs are actually occupying your workers.
  • Explicitly list your jobs and turn off dynamic mode and you can stop processing a specific job in case of problems, by commenting out that line in the config file and redeploying.
  • If you keep dynamic mode turned off normally it will force you to keep your sidekiq.yml file up to date

4.2 But

This is a lot more complicated. Trying to reason about queue priorities is suprisingly hard.

  • If you only have a couple jobs, this is probably overkill.
  • sidekiq docs say you should avoid having dozens of queues. As a point of reference, we have one app with 36 queues which works very nicely.
  • If a higher priority job starts slowing down just a little, it will more easily starve out the lower priority jobs. With equal priority jobs a minor slowdown of one job type may not even affect the other jobs if there aren't too many of the slow jobs, or they don't slow down tooo much. With strict priorities it's a lot easier for one job to block all of the others.

4.3 Reasoning About Queue Priorities

Imagine a process that depends on running job A, which fans out 50 of job B, then job C finishes up and combines all the results.

If job B is higher priority than C, they can choke off all of the Cs under load so that non of these processes finish.

It's probably best to make things that go later in a process higher priority (C, B, A), but there may be times when it is best to start as early as possible (C, A, B or even A, C, B)

Remember that you might well be running multiple of these ABC processes at the same time. If you queue up 1000 of the A jobs, do you want to run all of them first and have 50,000 B jobs in the queue? Do you want to do the first A, then all 50 Bs, then finish a C before starting the next A? Give this a lot of thought.

If you have in addition to ABC a second DEF process, then… life is complicated.

:concurrency: 5
:pidfile: tmp/pids/sidekiq.pid
staging:
  :concurrency: 10
production:
  :concurrency: 20
queues:
  - parallel_worker_job
  - results_compiler_job
  - process_start_job

The good news is that if you've got all these jobs in separate queues it's a simple matter to reorder lines in the sidekiq.yml if you realize you got it wrong.

5 How To Not Overload Other Services

One thing that comes up fairly often is that you can run sidekiq jobs faster than some external service can handle them. That external service might be a rate limited API, another microservice within your own company, or even your own database. Just the other day I increased sidekiq worker concurrency on one app so that we went from 21 threads to 35 threads in the worker pool. One type of job that inserts data into a busy table suddenly slowed down from < 1 second to upwards of 100 seconds. They were blocking each other as postgresql tried to make sure all those inserts didn't violate unique constraints.

Luckily, with the sidekiq-limit_fetch gem you can have your cake (giant worker pool) and eat it too (not choke). It allows you to globally limit how many of a given job type are running at once, or to limit how many threads in each sidekiq process can run a certain job type.

The first will let you limit the impact on external services. I limited my misbehaving job to 20 simultaneous jobs and they went back to being just as fast as before, and the extra worker threads were able to crank through other job types while they did it.

The second, limiting jobs to a certain number of threads per processes, can be good if a certain job type is especially memory or CPU heavy. Maybe you only have 1 gig of RAM on your workers, and running two MemoryHogJobs at once will make you run out of RAM.

:concurrency: 5
:pidfile: tmp/pids/sidekiq.pid
staging:
  :concurrency: 10
production:
  :concurrency: 20
queues:
  - one_at_a_time_job
  - internal_lookup_job
  - memory_hog_job
  - welcome_email_job
  - data_backfill_job
:limits:
  one_at_a_time_job: 1
  internal_lookup_job: 5
:process_limits:
  memory_hog_job: 1

5.1 You Get

  • You can safely write jobs which must never run two at a time. We do this when we want to avoid creating duplicate resources but don't want to try locking it down via database constraints.
  • You have a knob to turn to adjust how much load you send to other servers independent of the total amount of work you are doing. This means you can play nice with others in a way you couldn't before.

5.2 But

Your job system is even harder to reason about. Lower priority jobs can run even though there are higher priority jobs waiting in the queue. Thread limited jobs now become a resource that doesn't scale like the rest of your system.

6 Some Jobs Only Run On Certain Workers

This is the very complicated option. You deploy different sets of sidekiq workers with different sidekiq.yml files that each explicitly list different sets of jobs to process.

6.0.1 Bookface Worker Config

:concurrency: 5
:pidfile: tmp/pids/sidekiq.pid
staging:
  :concurrency: 10
production:
  :concurrency: 20
queues:
  - bookface_wall_scribbler
  - bookface_follower_poker

6.0.2 Snapstagram Worker Config

:concurrency: 5
:pidfile: tmp/pids/sidekiq.pid
staging:
  :concurrency: 10
production:
  :concurrency: 20
queues:
  - snapstagram_liker
  - snapstagram_feed_watcher

6.1 Why Would You Do This?

What if you have jobs A-E that all depend on Bookface APIs, and jobs F-J all depend on Snapstagram APIs. If you break them up into separate sets of workers, then a Bookface problem won't kill your Snapstagram support.

Further, when you're trying to reason about how different jobs impact each other your scope is much smaller. It's easier to figure out the relative priorities between 5 jobs than it is between 20 jobs.

You pay the price by having to manage two different worker pools, and double the inefficiency of your setup. Every background worker pool is somewhat inefficient. Either you constantly run a work backlog (which, in practice will occasionally start growing and having trouble) or you occasionally have idle worker threads - unused capacity. If you have two or more pools of workers, then each pool has it's own unused worker thread inefficiency.

7 Suggestions For Keeping It Simple

As you can gather from the discussion above, systems with lots and lots of sidekiq jobs can get really complicated. These wind up being large concurrent distributed systems, and any one of those adjectives can make a system hard to reason about. Here are a few ways to make that easier.

7.1 SRP, But For Jobs

Avoid jobs that check their parameters and then do completely different things based on them. If your perform method looks like

def perform(klass, meth, args)
  klass.constantize.send(meth, args)
end

or

def perform(job_id)
  job = Job.find(job_id)
  if job.network == "bookface"
    # use bookface API...
  elsif job.network == "snapstagram"
    # use snapstagram....
  end
end

You're violating this principle, and it's even worse than writing a method like this in normal code.

7.2 Avoid Jobs That Depend On Other Jobs

If you have jobs that depend on other jobs, suddenly the whole system gets harder to reason about. You need to start worrying about not just what priority the jobs are to the outside system, but what priority they are to each other. Unfortunately, being able to have jobs kick off other jobs and check back to see if that work is complete is a pretty powerful tool.

7.3 Polling Bad, Callbacks Good

If you do have those jobs that kick off other jobs and wait for them to finish, try to avoid having jobs requeue themselves and check for completion. You may need to do this when checking with an outside system, but internally it's much better to have job A kick off job B, which will then run job C when it's done. You can pass the identity and arguments for job C as arguments to job B if job B can be kicked off by lots of different kinds of jobs. This has the added benefit of separating out the "get started" and "finish up" portions of the work so they're easier to test.

Published on 07/08/2016 at 13h35 under , . Tags ,

Comments are disabled

Powered by Typo – Thème Frédéric de Villamil | Photo Glenn