Prevent worker from running same task repeatedly #2207

daveFNbuck · 2017-08-10T21:25:18Z

Description

When get_work returns a task_id already in the _running_tasks list, don't start running a new task.

Motivation and Context

If the scheduler keeps giving the worker the same task_id from get_work, the worker will just overwrite the same entry in _running_tasks and never think that it is running too many processes.

I've had to reboot one of my worker machines almost daily due to hundreds of worker processes being spawned to run the same task. This seems to be mostly happening with batch tasks, probably because they require more rounds of communication before the scheduler understands
that the task is running. Since this tends to happen only in short bursts, I've also added a sleep to give the scheduler a chance to recover.

This is still a bit worrisome because the scheduler could potentially give the task to other workers, but it at least patches over the bug enough that my pipelines can run smoothly again. My hope is that the issue is only with the scheduler re-issuing task_ids that it already assigned to that worker but the worker isn't including in its current_tasks list in get_work, in which case the scheduler will not send the task to other workers.

Have you tested this? If so, how?

I simulated a buggy scheduler both by tweaking the scheduler RPC functions and by tweaking the worker's RPC calls to trigger this bug reliably for batch tasks by not registering the batch runner. In both cases, this fix does the trick.

I've also been running this in production for a couple of days and have included unit tests.

Tarrasch

Not sure why this happens, but whenever it happens this seems like a good defense mechanism. :)

daveFNbuck · 2017-08-11T15:45:00Z

Yeah I'm still investigating the cause.

If the scheduler keeps giving the worker the same task_id from get_work, the worker will just overwrite the same entry in _running_tasks and never think that it is running too many processes. I've had to reboot one of my worker machines almost daily due to hundreds of worker processes being spawned to run the same task. This seems to be mostly happening with batch tasks, probably because they require more rounds of communication before the scheduler understands that the task is running. Since this tends to happen only in short bursts, I've also added a sleep to give the scheduler a chance to recover.

daveFNbuck · 2017-08-11T20:54:52Z

I've found a somewhat satisfying explanation for this. It seems to occur when exceptions are thrown. I had a few common uncaught exceptions happening in the scheduler (PRs upcoming). If exceptions are being thrown in the scheduler during get_work, it makes sense that the scheduler might end up with a weird internal state.

Since fixing the uncaught exceptions, I've only seen this error happen twice. Both times, it occurred at about the same time that the keep-alive thread raised a KeyError trying to look up response["rpc_messages"]. Not sure how this can happen, maybe some wires are getting crossed and responses are being read by the wrong thread? I'll try to catch the error and log the response in case it happens again.

Tarrasch · 2017-08-12T09:17:41Z

Thanks @daveFNbuck

Tarrasch approved these changes Aug 11, 2017

View reviewed changes

daveFNbuck force-pushed the no_process_leak branch from d32421d to eb742a6 Compare August 11, 2017 17:49

Tarrasch merged commit a69dffb into spotify:master Aug 12, 2017

daveFNbuck deleted the no_process_leak branch August 12, 2017 17:47

This was referenced Jun 29, 2022

no mo enum 34 #3180

Closed

enum34 be gone #3181

Closed

mdragilev mentioned this pull request Jun 28, 2024

for S3 contrib package move to boto3 Affirm/luigi#26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent worker from running same task repeatedly #2207

Prevent worker from running same task repeatedly #2207

daveFNbuck commented Aug 10, 2017

Tarrasch left a comment

daveFNbuck commented Aug 11, 2017

daveFNbuck commented Aug 11, 2017

Tarrasch commented Aug 12, 2017

Prevent worker from running same task repeatedly #2207

Prevent worker from running same task repeatedly #2207

Conversation

daveFNbuck commented Aug 10, 2017

Description

Motivation and Context

Have you tested this? If so, how?

Tarrasch left a comment

Choose a reason for hiding this comment

daveFNbuck commented Aug 11, 2017

daveFNbuck commented Aug 11, 2017

Tarrasch commented Aug 12, 2017