-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sensible defaults for submitJobs()
#84
Comments
Well, first of all all "errors" we are talking about here are of the "soft / temporary" kind, in "hard" cases we always abort. The main problem is that SSH is bit special here, "worker busy" can happen very often. OTOH what you want can easily be configured with "wait", "max.retries" and "job.delay". |
This would be a good start, but wouldn't solve the problem that you don't know how long a job will take to complete and the timeout needs to be higher than that. I would expect the command to submit all jobs, just like other batch job management systems. |
But we are not on a batch system? We try our best with SSH but we cannot work magic. And you dont need any calculation? Set max.retries to Inf and wait to a constant of eg 10 secs? |
I mean this is "basic polling" and we dont know "a priori" how long a worker is still going to work for a certain job? |
I am not sure what you mean with "time.out" though? the max.retries? |
The combination of max.retries and the wait function. Ok, setting the former to Inf would solve the problem, so if that could be specified in the settings file it would be great. |
That was my idea. (What i dont wnat by default in the the package is an infinite process that spams computers) |
Ok, fair enough. Although I guess you could find out if a worker is busy because it's running other jobs as opposed to a worker that's just busy because of external influences. |
yes IIRC we have that already |
Would it make sense to make that different error codes? |
submitJobs()
has a maximum number of retries for submit errors and a function to determine the wait time between them. The problem is that "worker busy" counts as such an error. This means that if the first (n - |workers|) jobs take longer than the combined wait time, the function will exit with an error even though there is nothing actually wrong. The final jobs won't be submitted in this case.It would be good to have the default be "wait until all jobs are submitted unless there are actual errors". At the moment, I have to go back if the jobs take a long time to complete and resubmit manually.
The text was updated successfully, but these errors were encountered: