Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline can fail if segmentation server already running #1199

Open
spigo900 opened this issue Oct 18, 2022 · 1 comment
Open

Pipeline can fail if segmentation server already running #1199

spigo900 opened this issue Oct 18, 2022 · 1 comment

Comments

@spigo900
Copy link
Collaborator

Currently when object segmentation is enabled, the pipeline script sets up dependencies between jobs so that everything fails if the segmentation server isn't able to bind to a port. Unfortunately this means that if you run multiple experiments with STEGO in parallel then they trip over each other at the server part. One of the experiment will run the server successfully and all others will fail to. The successful experiment will then cancel the server as soon as it is finished with the server, even though other jobs may still need to use the server (and may be actively using it).

This is a problem because it makes it harder to run such experiments in parallel.

A dumb solution is to just apply --dependency=singleton to the server job. This should prevent multiple "start server" jobs from running at one time. However, this has the disadvantage that the entire rest of the experiment has to wait on "that experiment's" server start job to run, meaning you can't interleave jobs from different experiments even though in theory their segmentation jobs could share a server.

This dumb solution is currently the best I've thought of. I think it's good enough for now. Below I sketched some (messy) thoughts on what a better solution would need (based on the changeover issue mentioned there I don't think it's worth it).

Messy thoughts on a better solution

Here are some thoughts about what we'd need for a better solution:

  1. A smarter solution would first have to check if the server is reachable, not if a particular job is running.
    1. This should be relatively straightforward -- rather than having the segmentation job depend on the server's Slurm job, instead have it do curl in a loop, sleep for a minute or two if it fails, retry up to say 5 times.
  2. But second, we probably don't want to run the server 24/7, so we need to clean it up somehow.
    1. Waiting on Slurm timeout (23 hours) would be fine except that you may have queued up e.g. 8 jobs if you're launching 8 parallel experiments, which means it'll be a long time before those jobs all collectively run and time out.
    2. Here's a hacky idea: We could queue an scancel job delayed to run in 24 hours (using --begin=now+24hours) which cancels all currently queued server jobs (as gathered using squeue) and this might work OK combined with the singleton option (since that way we're guaranteed at least one server should be running).
  3. But no matter how we clean up the server, at some point one server job must end and another must start. This is likely to be a pain if a segmentation job is running during the "change-over" between server jobs.
    1. If we really wanted to do this, we'd have to modify the Python script that does segmentation so that it can handle the server going down for up to a few minutes at least, and potentially longer, depending on how lucky we are with Slurm allocations.
    2. Meanwhile, I am reasonably confident we don't have to worry about changeover if the server job started around the same time as the segmentation job, as it would have to if we're just adding singleton on top of the existing setup. M5 objects train is 600 images and it only takes 6 hours, and I think the new curricula we're expecting are no larger than that. So we should be well within the time limit for the server job.
    3. So: Probably the singleton approach is best for now.
@lichtefeld
Copy link
Collaborator

I agree currently the singleton is currently our best approach.

spigo900 added a commit that referenced this issue Oct 19, 2022
Pipeline: Run segmentation server as singleton (#1199)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants