Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding a new job to the queue when a busy worker is near the max job limit can spawn a second competing worker #219

Closed
ssube opened this issue Mar 6, 2023 · 1 comment
Labels
scope/api status/fixed issues that have been fixed and released type/bug broken features
Milestone

Comments

@ssube
Copy link
Owner

ssube commented Mar 6, 2023

Adding a new job to the queue for a particular device when that device is 1 job below the max_jobs_per_worker limit (f.ex, it has completed 9 jobs with a limit of 10), when that worker is already running a job, can spawn a second competing worker that tries to use the same device.

It looks like the worker.join() call times out, and there is no fallback to .terminate() because that was breaking the queues on Windows. The pool.recycle() method otherwise handles everything well and spawns a new worker, which shares the device until they both run out of VRAM.

@ssube ssube added status/new issues that have not been confirmed yet type/bug broken features scope/api labels Mar 6, 2023
@ssube ssube added this to the v0.9 milestone Mar 6, 2023
@ssube ssube modified the milestones: v0.9, v0.8 Mar 6, 2023
@ssube ssube added status/progress issues that are in progress and have a branch and removed status/new issues that have not been confirmed yet labels Mar 6, 2023
@ssube
Copy link
Owner Author

ssube commented Mar 6, 2023

I've tried a half-dozen different methods for having the server process kill the workers, but short of using .terminate(), none of them seem to be effective.

What does work is having the worker exit() itself when it is no longer the primary worker for that device. That's easy enough to track in the server process with a Value containing the PID of the primary/only worker for that device. Before the worker starts each job, it checks to make sure that it is the current worker, and exits if not.

There can be a brief memory leak while the two workers co-exist, but the older worker reliably exits, which frees its memory and leaves it for the newer one.

@ssube ssube added status/fixed issues that have been fixed and released and removed status/progress issues that are in progress and have a branch labels Mar 6, 2023
@ssube ssube closed this as completed Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope/api status/fixed issues that have been fixed and released type/bug broken features
Projects
None yet
Development

No branches or pull requests

1 participant