Prevent worker from running same task repeatedly #2207
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
When get_work returns a task_id already in the _running_tasks list, don't start running a new task.
Motivation and Context
If the scheduler keeps giving the worker the same task_id from get_work, the worker will just overwrite the same entry in _running_tasks and never think that it is running too many processes.
I've had to reboot one of my worker machines almost daily due to hundreds of worker processes being spawned to run the same task. This seems to be mostly happening with batch tasks, probably because they require more rounds of communication before the scheduler understands
that the task is running. Since this tends to happen only in short bursts, I've also added a sleep to give the scheduler a chance to recover.
This is still a bit worrisome because the scheduler could potentially give the task to other workers, but it at least patches over the bug enough that my pipelines can run smoothly again. My hope is that the issue is only with the scheduler re-issuing task_ids that it already assigned to that worker but the worker isn't including in its current_tasks list in get_work, in which case the scheduler will not send the task to other workers.
Have you tested this? If so, how?
I simulated a buggy scheduler both by tweaking the scheduler RPC functions and by tweaking the worker's RPC calls to trigger this bug reliably for batch tasks by not registering the batch runner. In both cases, this fix does the trick.
I've also been running this in production for a couple of days and have included unit tests.