Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent worker from running same task repeatedly #2207

Merged
merged 1 commit into from
Aug 12, 2017

Conversation

daveFNbuck
Copy link
Contributor

Description

When get_work returns a task_id already in the _running_tasks list, don't start running a new task.

Motivation and Context

If the scheduler keeps giving the worker the same task_id from get_work, the worker will just overwrite the same entry in _running_tasks and never think that it is running too many processes.

I've had to reboot one of my worker machines almost daily due to hundreds of worker processes being spawned to run the same task. This seems to be mostly happening with batch tasks, probably because they require more rounds of communication before the scheduler understands
that the task is running. Since this tends to happen only in short bursts, I've also added a sleep to give the scheduler a chance to recover.

This is still a bit worrisome because the scheduler could potentially give the task to other workers, but it at least patches over the bug enough that my pipelines can run smoothly again. My hope is that the issue is only with the scheduler re-issuing task_ids that it already assigned to that worker but the worker isn't including in its current_tasks list in get_work, in which case the scheduler will not send the task to other workers.

Have you tested this? If so, how?

I simulated a buggy scheduler both by tweaking the scheduler RPC functions and by tweaking the worker's RPC calls to trigger this bug reliably for batch tasks by not registering the batch runner. In both cases, this fix does the trick.

I've also been running this in production for a couple of days and have included unit tests.

Copy link
Contributor

@Tarrasch Tarrasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this happens, but whenever it happens this seems like a good defense mechanism. :)

@daveFNbuck
Copy link
Contributor Author

Yeah I'm still investigating the cause.

If the scheduler keeps giving the worker the same task_id from get_work,
the worker will just overwrite the same entry in _running_tasks and
never think that it is running too many processes.

I've had to reboot one of my worker machines almost daily due to
hundreds of worker processes being spawned to run the same task. This
seems to be mostly happening with batch tasks, probably because they
require more rounds of communication before the scheduler understands
that the task is running. Since this tends to happen only in short
bursts, I've also added a sleep to give the scheduler a chance to
recover.
@daveFNbuck
Copy link
Contributor Author

I've found a somewhat satisfying explanation for this. It seems to occur when exceptions are thrown. I had a few common uncaught exceptions happening in the scheduler (PRs upcoming). If exceptions are being thrown in the scheduler during get_work, it makes sense that the scheduler might end up with a weird internal state.

Since fixing the uncaught exceptions, I've only seen this error happen twice. Both times, it occurred at about the same time that the keep-alive thread raised a KeyError trying to look up response["rpc_messages"]. Not sure how this can happen, maybe some wires are getting crossed and responses are being read by the wrong thread? I'll try to catch the error and log the response in case it happens again.

@Tarrasch Tarrasch merged commit a69dffb into spotify:master Aug 12, 2017
@Tarrasch
Copy link
Contributor

Thanks @daveFNbuck

@daveFNbuck daveFNbuck deleted the no_process_leak branch August 12, 2017 17:47
This was referenced Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants