Dequeing should be first-in first-out #651

jrochkind · 2022-06-30T19:37:45Z

Connects to #624.

For the sake of having something concrete to discuss!

Secondarily to priority, within a given priority (including null priority), dequeue respecting first-in first-out, by sorting on created_at

FIFO is how all other ActiveJob queues perform (so far as I know), and is kind of implied by the very word "queue". It will be less surprising to those coming from other ActiveJob queues, and that this is how every other queue performs is probably a signal that there are good reasons for it. (I can definitely think of many desirable things in my own use cases).

This PR does not yet include any tests. Any advice about the best place/way to do a test on ordering of dequeue? I looked for an existing test on ordering of dequeue based on priority to use as a model, but didn't find one of those. If one of those doesn't exist yet, would be good to add that too. I think we do want tests on dequeue ordering.
I tried to enhance the existing benchmark script for this case. It does seem to suggest that including this sort is a bit slower? That's surprising to me, I wasn't really expecting it. I'm not sure to what extent speed to dequeue is really a bottleneck in most real use cases though -- it may depend on how many jobs you have in queue of course. If we do want to speed it up, perhaps there is a way to do so with indexing... but there's already a created_at index, although maybe it can't be used in this particular sort, I'm not really a pg index expert.

Ideally I think FIFO would just be standard always, as here, because I think of it as table stakes for an ActiveJob queue. But if we do need to make it configurable because of perf reasons... perhaps make paying attention to priority configurable too? Not sure.

Secondarily to priority, within a given priority (including null priority), dequeue respecting first-in first-out, by sorting on created_at

jrochkind · 2022-06-30T19:49:15Z

Warning: I'm actually going on vacation at end of day today, and wont' be back until Monday July 11th, so won't respond to any feedback/comments until then, sorry!

Also: Benchmarking -- Locally, I tried to add a fourth benchmarking case for JUST created_at, without priority -- and in that run, got the benchmarker saying that all four cases were same-ish: difference falls within error. I think we have more work to do with the benchmarker, I'm not sure what's going on here. It may be that even 10_000 db records isn't enough in db to see any real difference, you might see it at millions instead.

jrochkind · 2022-06-30T19:56:03Z

OK, I tried actually adding a million records to the db (haven't committed this to this PR), and saw a difference, like so...

Calculating -------------------------------------
       priority only      6.612  (± 0.0%) i/s -     33.000  in   5.009988s
             no sort     10.065  (± 9.9%) i/s -     51.000  in   5.081519s
   priority and FIFO      3.237  (± 0.0%) i/s -     17.000  in   5.260931s
           FIFO only      6.643  (±15.1%) i/s -     33.000  in   5.014835s

Comparison:
             no sort:       10.1 i/s
           FIFO only:        6.6 i/s - 1.52x  (± 0.00) slower
       priority only:        6.6 i/s - 1.52x  (± 0.00) slower
   priority and FIFO:        3.2 i/s - 3.11x  (± 0.00) slower

Maybe just the result of having two sort conditions instead of one, since both of the single sort orderings were identical? May having more than one sort ordering (sql ORDER with a comma) is inherently slightly slower like this, and nothing to be done?

I suspect this isn't going to be an issue for less than hundreds of thousands of enqueued records. I also wonder how much "speed to dequeue" matters in total execution time -- good_job already isn't trying to compete on "absolute fastest ActiveJob queue at dequeing records from the queue".

Do those results really mean it can only dequeue 6-per-second in priority-sort, the current behavior? That seems surprisingly slow! I guess if you have a million records in the queue though... Going back to only 10K records in the queue, it's more like 60 dequeues per second (still not winning any contests with sidekiq, but I don't think it matters), and is not very different in the different cases, adding the second sort order is slight-to-no difference. It's possible we could add indexes to improve the million record performance, not sure.

bensheldon · 2022-06-30T20:02:56Z

@jrochkind thank you thank you! This looks great. I'll do a little more poking and get this merged. I'm happy to finish it out if I discover anything that needs tweaking.

jrochkind · 2022-06-30T20:36:39Z

Thanks!

I really do think we should have some tests -- having some tests for order of dequeue will help me for my next desired PR, to do queue-priority ala resque and sidekiq, as we were talking about previously.

I just wasn't able to figure out how to do those tests yet in the time I have today!

jrochkind · 2022-07-11T14:14:27Z

Hi @bensheldon , I'm back from a week off. Thanks for adding the test!

I'm curious what your current thoughts on this one are? And is there anything else I can do on it?

bensheldon · 2022-07-11T15:40:07Z

@jrochkind welcome back! Hope you had a restful week off 🌴

Let's do it. I think FIFO is a good improvement. I'm not too worried about performance because I don't think it makes performance meaningfully worse than it already is at huge queue sizes. We should maybe open a separate Discussion for that.

jrochkind · 2022-07-11T15:56:46Z

Thanks!

If I have time, i'm going to follow-up with a more complex PR for "ordered queues" too, which I think is pretty do-able, and will only have a performance implication if someone chooses to use it, it'll be something you choose to use in a given worker, similar to how it works in resque and sidekiq.

jrochkind · 2022-07-11T15:59:35Z

Btw, my intuition says there MIGHT be a way to use FIFO to actually make performance BETTER at huge queue sizes than it was previously, as there should be a way to make pg use the created_at index -- but it MIGHT require turning off the priority feature. Like maybe you turn off the priority function with configuration -- workers are now only FIFO, as well as working on specific queues, and no longer pay attention to priority, and in return you get improved performance for large queues. (Not every other ActiveJob back-end supports priorities anyway, some do some don't).

bensheldon · 2022-07-11T16:00:11Z

@jrochkind I'm colder on the ordered-queues option. Having had more discussion in #624, I think that's trying to recreate behavior that isn't necessary with GoodJob because GoodJob has job-level priority, whereas Sidekiq and Resque only have queue-level priority.

jrochkind · 2022-07-11T16:00:30Z

I'll go discuss more there.

jrochkind · 2022-07-11T16:29:30Z

I will give you my use case for this one here, because it is related to "well, can't you just add more capacity" -- your original argument against FIFO I think.

I have very spikey job patterns, there can be a while with few jobs, then a period iwth a lot of jobs at once I have my workers on heroku with an autoscaler from hirefire. So there might be only one worker dyno running normally, but if the autoscaler sees a lot of work building up in the queue, it scales up more worker dynos.

But it can take a few minutes for the dynos to scale up. This might be fine, because the load might be expected with like "ten minutes" latency. Or in pathological cases, it might take even longer -- it might scale up a few dynos, see that the queue has gotten even BIGGEr, then scale up a few more (wait a few more minutes), wash rinse repeat. It could be 20+ minutes until all the workers are fully scaled up to handle the load.

That's okay in exceptional situations -- but during that time I want FIFO, I would not want something that came in 30 minutes ago happening, by chance, to still be waiting in the queue, while something that came in 1 second ago gets handled right away. The latency of the auto-scaler is acceptable because the queue is FIFO, so even if it takes 20 minutes to get the resources fully allocated, that doesn't necessarily mean any job was waiting 20 minutes, the oldest jobs are being handled first, the jobs waiting are the ones that came in later. It significantly reduces median or mean latency of jobs, under circumstances where the auto-scaler is catching up with capacity.

Dequeing should be first-in first-out

ff80928

Secondarily to priority, within a given priority (including null priority), dequeue respecting first-in first-out, by sorting on created_at

bensheldon added 2 commits July 3, 2022 07:43

Add tests for dequeue ordering

7a672e1

Fix linting

051d3b8

bensheldon merged commit 248bb98 into bensheldon:main Jul 11, 2022

bensheldon added the enhancement New feature or request label Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dequeing should be first-in first-out #651

Dequeing should be first-in first-out #651

jrochkind commented Jun 30, 2022 •

edited by bensheldon

Loading

jrochkind commented Jun 30, 2022 •

edited

Loading

jrochkind commented Jun 30, 2022 •

edited

Loading

bensheldon commented Jun 30, 2022

jrochkind commented Jun 30, 2022

jrochkind commented Jul 11, 2022

bensheldon commented Jul 11, 2022

jrochkind commented Jul 11, 2022

jrochkind commented Jul 11, 2022 •

edited

Loading

bensheldon commented Jul 11, 2022

jrochkind commented Jul 11, 2022

jrochkind commented Jul 11, 2022

Dequeing should be first-in first-out #651

Dequeing should be first-in first-out #651

Conversation

jrochkind commented Jun 30, 2022 • edited by bensheldon Loading

jrochkind commented Jun 30, 2022 • edited Loading

jrochkind commented Jun 30, 2022 • edited Loading

bensheldon commented Jun 30, 2022

jrochkind commented Jun 30, 2022

jrochkind commented Jul 11, 2022

bensheldon commented Jul 11, 2022

jrochkind commented Jul 11, 2022

jrochkind commented Jul 11, 2022 • edited Loading

bensheldon commented Jul 11, 2022

jrochkind commented Jul 11, 2022

jrochkind commented Jul 11, 2022

jrochkind commented Jun 30, 2022 •

edited by bensheldon

Loading

jrochkind commented Jun 30, 2022 •

edited

Loading

jrochkind commented Jun 30, 2022 •

edited

Loading

jrochkind commented Jul 11, 2022 •

edited

Loading