Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update local locking logic #1842

Merged
merged 2 commits into from
Sep 12, 2016
Merged

Conversation

Tarrasch
Copy link
Contributor

@Tarrasch Tarrasch commented Sep 6, 2016

Motivation and Context

This patch is made to address some issues with local locking, that is
the locking that deals with the .pid files typically in each machines
/tmp/luigi folder.

There's one bugfix (can massively improve startup performence):

  • Don't keep old entries in the .pid files

There's one change in logic:

  • When running with --take-lock, only allow for one "additional" task
    being run, the previous behavior was that --take-lock tasks always
    could keep running. With this new behavior, it could perhaps be
    possible to always cron your lines with --take-lock.

Lastly, there was much refactoring of the code. Although longer, I find
the readability to have drastcially improved.

Have you tested this? If so, how?

There are added unittests for the two behavioural changes. I verified that they didn't pass before.

Furthermore I'll let this run for me a few days before merging, just in case.

This patch is made to address some issues with local locking, that is
the locking that deals with the .pid files typically in each machines
/tmp/luigi folder.

There's one bugfix (can massively improve startup performence):

  * Don't keep old entries in the .pid files

There's one change in logic:

  * When running with --take-lock, only allow for one "additional" task
    being run, the previous behavior was that --take-lock tasks always
    could keep running. With this new behavior, it could perhaps be
    possible to always cron your lines with --take-lock.

Lastly, there was much refactoring of the code. Although longer, I find
the readability to have drastcially improved.
@mention-bot
Copy link

@Tarrasch, thanks for your PR! By analyzing the annotation information on this pull request, we identified @daveFNbuck, @erikbern and @gpoulin to be potential reviewers

@Tarrasch
Copy link
Contributor Author

Tarrasch commented Sep 6, 2016

@daveFNbuck, can you review this? At least the part regarding the changed behaviour in --take-lock. I don't know exactly how you intended it. But I felt it to feel immature and unstable when it seems like you can launch unlimited amounts of those tasks without hitting the file-system lock. Now there's a special error message and sane default of 1 extra allowed process. -- That sounds sane to me at least. Let me know if you agree. ^^

@daveFNbuck
Copy link
Contributor

I intended --take-lock to be used to launch new assistants. Using a continuous deployment with my CI, I would launch a new assistant using --take-lock with each deploy. The old assistants would receive SIGUSR1 and stop requesting new jobs but would still finish their current work. That way, I'd always be running the latest code when I start a new job but I wouldn't be constantly killing my running jobs with each deploy.

This ended up not working very well for me and I haven't used it for a while. The main problem was that it was hard to guarantee that I'd have any assistants running at all. If the running assistant died for any reason, nothing would run until the next deploy. This could cause big problems if it happened over the night or on the weekend. Your change would make this much more likely to happen, even if I also used a cron job. A couple long-running jobs on each assistant and nothing new would ever launch.

I'm ok with removing --take-lock entirely since I don't see a good way to fix it, but there's a small chance others are using it. In what way are you using it?

I'm currently trying to run lots of 1-worker assistants with supervisor. With each deploy, I send SIGUSR1 to all of them and the supervisor ensures that they come back up. This seems to be more reliable so far but I still have some issues I haven't figured out yet.

@Tarrasch
Copy link
Contributor Author

Tarrasch commented Sep 7, 2016

Thanks for sharing your experience @daveFNbuck. I pretty clearly understand your issues and your tried attempts. What do you think the following setup? Assuming the behaviour introduced in this PR:


Well. What about hourly cronning luigi --assistant --take-lock --lock-size 4 --workers 10? It feels very robust to me as there's no need to rely on events. I see three drawbacks.

  1. One is that the new code could take up to an additional hour to start after deploy (you can combine asynchronous and synchronous launches to remedy this).
  2. The second problem seems to what you mentioned before, if you were to have jobs that can take up to 4-5 hours to complete. You could theoretically (and maybe also practically) end up with having 5 assistants each running 1 never-ending job each.
  3. Normally I would expect about 0-20 jobs to be running, but there's very well possible that there's 50 jobs running at the same time too, if jobs live really long.

Whatever we do, I really don't like the old behaviour, as it seems like "everything" changes when you supply --take-lock, if you still like the old behaviour you can get it back with --take-lock --lock-size 10000, but with this patch we're kinda hinting that a +1 in lock-size is a sane default when using --take-lock.

@daveFNbuck
Copy link
Contributor

I don't like the old behavior either and I don't use --take-lock anymore at all. Whatever you want it to do is fine because I'm guessing no one else uses it either. I'd recommend trying my setup where you have supervisor keep 20 supervisors alive and send SIGUSR1 to all of them whenever you deploy. I'd certainly appreciate if others are trying to make it work too :)

@Tarrasch
Copy link
Contributor Author

Tarrasch commented Sep 7, 2016

If nobody objects (please feel free to objects/review), I'll merge this early next week :) (I'm off 2 days)

@Tarrasch Tarrasch merged commit 2d0e4a1 into spotify:master Sep 12, 2016
@Tarrasch Tarrasch deleted the locking-behavior branch January 11, 2017 04:10
This was referenced Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants