Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dequeing should be first-in first-out #651

Merged
merged 3 commits into from
Jul 11, 2022
Merged

Conversation

jrochkind
Copy link
Contributor

@jrochkind jrochkind commented Jun 30, 2022

Connects to #624.

For the sake of having something concrete to discuss!

Secondarily to priority, within a given priority (including null priority), dequeue respecting first-in first-out, by sorting on created_at

FIFO is how all other ActiveJob queues perform (so far as I know), and is kind of implied by the very word "queue". It will be less surprising to those coming from other ActiveJob queues, and that this is how every other queue performs is probably a signal that there are good reasons for it. (I can definitely think of many desirable things in my own use cases).

  • This PR does not yet include any tests. Any advice about the best place/way to do a test on ordering of dequeue? I looked for an existing test on ordering of dequeue based on priority to use as a model, but didn't find one of those. If one of those doesn't exist yet, would be good to add that too. I think we do want tests on dequeue ordering.

  • I tried to enhance the existing benchmark script for this case. It does seem to suggest that including this sort is a bit slower? That's surprising to me, I wasn't really expecting it. I'm not sure to what extent speed to dequeue is really a bottleneck in most real use cases though -- it may depend on how many jobs you have in queue of course. If we do want to speed it up, perhaps there is a way to do so with indexing... but there's already a created_at index, although maybe it can't be used in this particular sort, I'm not really a pg index expert.

Ideally I think FIFO would just be standard always, as here, because I think of it as table stakes for an ActiveJob queue. But if we do need to make it configurable because of perf reasons... perhaps make paying attention to priority configurable too? Not sure.

Secondarily to priority, within a given priority (including null priority), dequeue respecting first-in first-out, by sorting on created_at
@jrochkind
Copy link
Contributor Author

jrochkind commented Jun 30, 2022

Warning: I'm actually going on vacation at end of day today, and wont' be back until Monday July 11th, so won't respond to any feedback/comments until then, sorry!

Also: Benchmarking -- Locally, I tried to add a fourth benchmarking case for JUST created_at, without priority -- and in that run, got the benchmarker saying that all four cases were same-ish: difference falls within error. I think we have more work to do with the benchmarker, I'm not sure what's going on here. It may be that even 10_000 db records isn't enough in db to see any real difference, you might see it at millions instead.

@jrochkind
Copy link
Contributor Author

jrochkind commented Jun 30, 2022

OK, I tried actually adding a million records to the db (haven't committed this to this PR), and saw a difference, like so...

Calculating -------------------------------------
       priority only      6.612  (± 0.0%) i/s -     33.000  in   5.009988s
             no sort     10.065  (± 9.9%) i/s -     51.000  in   5.081519s
   priority and FIFO      3.237  (± 0.0%) i/s -     17.000  in   5.260931s
           FIFO only      6.643  (±15.1%) i/s -     33.000  in   5.014835s

Comparison:
             no sort:       10.1 i/s
           FIFO only:        6.6 i/s - 1.52x  (± 0.00) slower
       priority only:        6.6 i/s - 1.52x  (± 0.00) slower
   priority and FIFO:        3.2 i/s - 3.11x  (± 0.00) slower

Maybe just the result of having two sort conditions instead of one, since both of the single sort orderings were identical? May having more than one sort ordering (sql ORDER with a comma) is inherently slightly slower like this, and nothing to be done?

I suspect this isn't going to be an issue for less than hundreds of thousands of enqueued records. I also wonder how much "speed to dequeue" matters in total execution time -- good_job already isn't trying to compete on "absolute fastest ActiveJob queue at dequeing records from the queue".

Do those results really mean it can only dequeue 6-per-second in priority-sort, the current behavior? That seems surprisingly slow! I guess if you have a million records in the queue though... Going back to only 10K records in the queue, it's more like 60 dequeues per second (still not winning any contests with sidekiq, but I don't think it matters), and is not very different in the different cases, adding the second sort order is slight-to-no difference. It's possible we could add indexes to improve the million record performance, not sure.

@bensheldon
Copy link
Owner

@jrochkind thank you thank you! This looks great. I'll do a little more poking and get this merged. I'm happy to finish it out if I discover anything that needs tweaking.

@jrochkind
Copy link
Contributor Author

Thanks!

I really do think we should have some tests -- having some tests for order of dequeue will help me for my next desired PR, to do queue-priority ala resque and sidekiq, as we were talking about previously.

I just wasn't able to figure out how to do those tests yet in the time I have today!

@jrochkind
Copy link
Contributor Author

Hi @bensheldon , I'm back from a week off. Thanks for adding the test!

I'm curious what your current thoughts on this one are? And is there anything else I can do on it?

@bensheldon
Copy link
Owner

@jrochkind welcome back! Hope you had a restful week off 🌴

Let's do it. I think FIFO is a good improvement. I'm not too worried about performance because I don't think it makes performance meaningfully worse than it already is at huge queue sizes. We should maybe open a separate Discussion for that.

@bensheldon bensheldon merged commit 248bb98 into bensheldon:main Jul 11, 2022
@jrochkind
Copy link
Contributor Author

Thanks!

If I have time, i'm going to follow-up with a more complex PR for "ordered queues" too, which I think is pretty do-able, and will only have a performance implication if someone chooses to use it, it'll be something you choose to use in a given worker, similar to how it works in resque and sidekiq.

@jrochkind
Copy link
Contributor Author

jrochkind commented Jul 11, 2022

Btw, my intuition says there MIGHT be a way to use FIFO to actually make performance BETTER at huge queue sizes than it was previously, as there should be a way to make pg use the created_at index -- but it MIGHT require turning off the priority feature. Like maybe you turn off the priority function with configuration -- workers are now only FIFO, as well as working on specific queues, and no longer pay attention to priority, and in return you get improved performance for large queues. (Not every other ActiveJob back-end supports priorities anyway, some do some don't).

@bensheldon
Copy link
Owner

@jrochkind I'm colder on the ordered-queues option. Having had more discussion in #624, I think that's trying to recreate behavior that isn't necessary with GoodJob because GoodJob has job-level priority, whereas Sidekiq and Resque only have queue-level priority.

@jrochkind
Copy link
Contributor Author

I'll go discuss more there.

@jrochkind
Copy link
Contributor Author

I will give you my use case for this one here, because it is related to "well, can't you just add more capacity" -- your original argument against FIFO I think.

I have very spikey job patterns, there can be a while with few jobs, then a period iwth a lot of jobs at once I have my workers on heroku with an autoscaler from hirefire. So there might be only one worker dyno running normally, but if the autoscaler sees a lot of work building up in the queue, it scales up more worker dynos.

But it can take a few minutes for the dynos to scale up. This might be fine, because the load might be expected with like "ten minutes" latency. Or in pathological cases, it might take even longer -- it might scale up a few dynos, see that the queue has gotten even BIGGEr, then scale up a few more (wait a few more minutes), wash rinse repeat. It could be 20+ minutes until all the workers are fully scaled up to handle the load.

That's okay in exceptional situations -- but during that time I want FIFO, I would not want something that came in 30 minutes ago happening, by chance, to still be waiting in the queue, while something that came in 1 second ago gets handled right away. The latency of the auto-scaler is acceptable because the queue is FIFO, so even if it takes 20 minutes to get the resources fully allocated, that doesn't necessarily mean any job was waiting 20 minutes, the oldest jobs are being handled first, the jobs waiting are the ones that came in later. It significantly reduces median or mean latency of jobs, under circumstances where the auto-scaler is catching up with capacity.

@bensheldon bensheldon added the enhancement New feature or request label Jul 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants