-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always relaunch backlogged inactive workers #79
Comments
Ah, the other change could have been how I was testing it. Previously, I was focused on idle time. The test above sets infinite idle time but max tasks equal to 100, which is a better way to expose this remaining problem in |
Tomorrow, I will tackle this one and #78. Maybe there is something I can do for #77, but I am still scratching my head on that one. The latest at https://github.com/ropensci/targets/actions/runs/5009290193 shows an "object closed" error from a task, as well as a 360-second timeout in a different test. |
Yes, that's right. I didn't emphasize it at the time, but it's what I meant by "always launch at least those servers" here: #75 (comment)
Yes, I experienced this delay as well, which I thought a bit odd. But with the throttling implementation, this is now pretty much instant, at least for the previous tests. |
Now fixed. |
For #124 |
It is not enough to simply prioritize backlogged workers above non-backlogged ones in
controller$scale()
. Regardless of the task load,controller$try_launch()
must always try to re-launch inactive backlogged workers as determined byinactive & (controller$router$assigned > controller$router$complete)
. The following test is failing because ofcontroller$try_launch()
is not aggressive enough on this point. About 8-10 tasks hang because the workers they are assigned to don't make it back to launch. This happens to the trailing tasks at the end of a pipeline because the task load is too low by then and there is no longer explicit demand for those workers due to the presence of the other online workers. (Those online workers happen to be the wrong workers for the assigned tasks at that point). When I tried manually relaunching just the correct workers, the stuck tasks completed.I think this is a recurrence of #75, though it's odd that I am seeing overt hanging now, as opposed to slowness
as those once stuck tasks get reassigned(slowness because my earlier versions of this test capped idle time, which allowed workers to rotate more easily and thus backlogged workers could percolate to the top for relaunch).The only changes were throttling in(Those changes don’t matter for this issue, only the change in my test which exposed more of #75.)crew
, which shouldn't affect this, andmirai
shikokuchuo/mirai@e270057. Doesn't necessarily point to an issue inmirai
, but it is odd.But anyway, the solution to this issue will be good to have anyway. Right after @shikokuchuo solved #75, the trailing hanging tasks did end up completing, but only after several long seconds of delay. I think I can eliminate that delay entirely by more aggressively launching backlogged inactive workers on
crew
's end.The text was updated successfully, but these errors were encountered: