Update local locking logic #1842

Tarrasch · 2016-09-06T09:41:27Z

Motivation and Context

This patch is made to address some issues with local locking, that is
the locking that deals with the .pid files typically in each machines
/tmp/luigi folder.

There's one bugfix (can massively improve startup performence):

Don't keep old entries in the .pid files

There's one change in logic:

When running with --take-lock, only allow for one "additional" task
being run, the previous behavior was that --take-lock tasks always
could keep running. With this new behavior, it could perhaps be
possible to always cron your lines with --take-lock.

Lastly, there was much refactoring of the code. Although longer, I find
the readability to have drastcially improved.

Have you tested this? If so, how?

There are added unittests for the two behavioural changes. I verified that they didn't pass before.

Furthermore I'll let this run for me a few days before merging, just in case.

This patch is made to address some issues with local locking, that is the locking that deals with the .pid files typically in each machines /tmp/luigi folder. There's one bugfix (can massively improve startup performence): * Don't keep old entries in the .pid files There's one change in logic: * When running with --take-lock, only allow for one "additional" task being run, the previous behavior was that --take-lock tasks always could keep running. With this new behavior, it could perhaps be possible to always cron your lines with --take-lock. Lastly, there was much refactoring of the code. Although longer, I find the readability to have drastcially improved.

mention-bot · 2016-09-06T09:41:28Z

@Tarrasch, thanks for your PR! By analyzing the annotation information on this pull request, we identified @daveFNbuck, @erikbern and @gpoulin to be potential reviewers

Tarrasch · 2016-09-06T10:31:09Z

@daveFNbuck, can you review this? At least the part regarding the changed behaviour in --take-lock. I don't know exactly how you intended it. But I felt it to feel immature and unstable when it seems like you can launch unlimited amounts of those tasks without hitting the file-system lock. Now there's a special error message and sane default of 1 extra allowed process. -- That sounds sane to me at least. Let me know if you agree. ^^

daveFNbuck · 2016-09-06T15:12:40Z

I intended --take-lock to be used to launch new assistants. Using a continuous deployment with my CI, I would launch a new assistant using --take-lock with each deploy. The old assistants would receive SIGUSR1 and stop requesting new jobs but would still finish their current work. That way, I'd always be running the latest code when I start a new job but I wouldn't be constantly killing my running jobs with each deploy.

This ended up not working very well for me and I haven't used it for a while. The main problem was that it was hard to guarantee that I'd have any assistants running at all. If the running assistant died for any reason, nothing would run until the next deploy. This could cause big problems if it happened over the night or on the weekend. Your change would make this much more likely to happen, even if I also used a cron job. A couple long-running jobs on each assistant and nothing new would ever launch.

I'm ok with removing --take-lock entirely since I don't see a good way to fix it, but there's a small chance others are using it. In what way are you using it?

I'm currently trying to run lots of 1-worker assistants with supervisor. With each deploy, I send SIGUSR1 to all of them and the supervisor ensures that they come back up. This seems to be more reliable so far but I still have some issues I haven't figured out yet.

Tarrasch · 2016-09-07T02:26:05Z

Thanks for sharing your experience @daveFNbuck. I pretty clearly understand your issues and your tried attempts. What do you think the following setup? Assuming the behaviour introduced in this PR:

Well. What about hourly cronning luigi --assistant --take-lock --lock-size 4 --workers 10? It feels very robust to me as there's no need to rely on events. I see three drawbacks.

One is that the new code could take up to an additional hour to start after deploy (you can combine asynchronous and synchronous launches to remedy this).
The second problem seems to what you mentioned before, if you were to have jobs that can take up to 4-5 hours to complete. You could theoretically (and maybe also practically) end up with having 5 assistants each running 1 never-ending job each.
Normally I would expect about 0-20 jobs to be running, but there's very well possible that there's 50 jobs running at the same time too, if jobs live really long.

Whatever we do, I really don't like the old behaviour, as it seems like "everything" changes when you supply --take-lock, if you still like the old behaviour you can get it back with --take-lock --lock-size 10000, but with this patch we're kinda hinting that a +1 in lock-size is a sane default when using --take-lock.

daveFNbuck · 2016-09-07T03:04:54Z

I don't like the old behavior either and I don't use --take-lock anymore at all. Whatever you want it to do is fine because I'm guessing no one else uses it either. I'd recommend trying my setup where you have supervisor keep 20 supervisors alive and send SIGUSR1 to all of them whenever you deploy. I'd certainly appreciate if others are trying to make it work too :)

Tarrasch · 2016-09-07T08:04:28Z

If nobody objects (please feel free to objects/review), I'll merge this early next week :) (I'm off 2 days)

Further readability refactorings

8539f35

Tarrasch merged commit 2d0e4a1 into spotify:master Sep 12, 2016

Tarrasch deleted the locking-behavior branch January 11, 2017 04:10

This was referenced Jun 29, 2022

no mo enum 34 #3180

Closed

enum34 be gone #3181

Closed

mdragilev mentioned this pull request Jun 28, 2024

for S3 contrib package move to boto3 Affirm/luigi#26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update local locking logic #1842

Update local locking logic #1842

Tarrasch commented Sep 6, 2016

mention-bot commented Sep 6, 2016

Tarrasch commented Sep 6, 2016

daveFNbuck commented Sep 6, 2016

Tarrasch commented Sep 7, 2016

daveFNbuck commented Sep 7, 2016

Tarrasch commented Sep 7, 2016

Update local locking logic #1842

Update local locking logic #1842

Conversation

Tarrasch commented Sep 6, 2016

Motivation and Context

Have you tested this? If so, how?

mention-bot commented Sep 6, 2016

Tarrasch commented Sep 6, 2016

daveFNbuck commented Sep 6, 2016

Tarrasch commented Sep 7, 2016

daveFNbuck commented Sep 7, 2016

Tarrasch commented Sep 7, 2016