Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always relaunch backlogged inactive workers #79

Closed
wlandau opened this issue May 18, 2023 · 5 comments
Closed

Always relaunch backlogged inactive workers #79

wlandau opened this issue May 18, 2023 · 5 comments
Assignees

Comments

@wlandau
Copy link
Owner

wlandau commented May 18, 2023

It is not enough to simply prioritize backlogged workers above non-backlogged ones in controller$scale(). Regardless of the task load, controller$try_launch() must always try to re-launch inactive backlogged workers as determined by inactive & (controller$router$assigned > controller$router$complete). The following test is failing because of controller$try_launch() is not aggressive enough on this point. About 8-10 tasks hang because the workers they are assigned to don't make it back to launch. This happens to the trailing tasks at the end of a pipeline because the task load is too low by then and there is no longer explicit demand for those workers due to the presence of the other online workers. (Those online workers happen to be the wrong workers for the assigned tasks at that point). When I tried manually relaunching just the correct workers, the stuck tasks completed.

I think this is a recurrence of #75, though it's odd that I am seeing overt hanging now, as opposed to slowness as those once stuck tasks get reassigned (slowness because my earlier versions of this test capped idle time, which allowed workers to rotate more easily and thus backlogged workers could percolate to the top for relaunch). The only changes were throttling in crew, which shouldn't affect this, and mirai shikokuchuo/mirai@e270057. Doesn't necessarily point to an issue in mirai, but it is odd. (Those changes don’t matter for this issue, only the change in my test which exposed more of #75.)

But anyway, the solution to this issue will be good to have anyway. Right after @shikokuchuo solved #75, the trailing hanging tasks did end up completing, but only after several long seconds of delay. I think I can eliminate that delay entirely by more aggressively launching backlogged inactive workers on crew's end.

library(crew)
controller <- crew_controller_local(
  workers = 20L,
  tasks_max = 100,
  seconds_exit = 10
)
controller$start()
names <- character(0L)
index <- 0L
n_tasks <- 6000L
while (index < n_tasks || !(controller$empty())) {
  if (index < n_tasks) {
    index <- index + 1L
    cat("submit", index, "\n")
    controller$push(
      name = as.character(index),
      command = TRUE
    )
  }
  out <- controller$pop()
  if (!is.null(out)) {
    cat("collect", out$name, "\n")
    names[[length(names) + 1L]] <- out$name
  }
}

# unresolved tasks
lapply(controller$queue, function(x) x$handle[[1L]]$data)

# daemons
controller$router$poll()
controller$router$daemons

controller$terminate()
@wlandau wlandau self-assigned this May 18, 2023
@wlandau
Copy link
Owner Author

wlandau commented May 18, 2023

I think this is a recurrence of #75, though it's odd that I am seeing overt hanging now, as opposed to slowness as those once stuck tasks get reassigned. The only changes were throttling in crew, which shouldn't affect this, and mirai shikokuchuo/mirai@e270057. Doesn't necessarily point to an issue in mirai, but it is odd.

Ah, the other change could have been how I was testing it. Previously, I was focused on idle time. The test above sets infinite idle time but max tasks equal to 100, which is a better way to expose this remaining problem in crew's auto-scaling logic.

@wlandau
Copy link
Owner Author

wlandau commented May 18, 2023

Tomorrow, I will tackle this one and #78. Maybe there is something I can do for #77, but I am still scratching my head on that one. The latest at https://github.com/ropensci/targets/actions/runs/5009290193 shows an "object closed" error from a task, as well as a 360-second timeout in a different test.

@shikokuchuo
Copy link
Contributor

It is not enough to simply prioritize backlogged workers above non-backlogged ones in controller$scale(). Regardless of the task load, controller$try_launch() must always try to re-launch inactive backlogged workers as determined by inactive & (controller$router$assigned > controller$router$complete). The following test is failing because of controller$try_launch() is not aggressive enough on this point. About 8-10 tasks hang because the workers they are assigned to don't make it back to launch. This happens to the trailing tasks at the end of a pipeline because the task load is too low by then and there is no longer explicit demand for those workers due to the presence of the other online workers. (Those online workers happen to be the wrong workers for the assigned tasks at that point). When I tried manually relaunching just the correct workers, the stuck tasks completed.

Yes, that's right. I didn't emphasize it at the time, but it's what I meant by "always launch at least those servers" here: #75 (comment)

But anyway, the solution to this issue will be good to have anyway. Right after @shikokuchuo solved #75, the trailing hanging tasks did end up completing, but only after several long seconds of delay. I think I can eliminate that delay entirely by more aggressively launching backlogged inactive workers on crew's end.

Yes, I experienced this delay as well, which I thought a bit odd. But with the throttling implementation, this is now pretty much instant, at least for the previous tests.

wlandau pushed a commit that referenced this issue May 18, 2023
wlandau pushed a commit that referenced this issue May 18, 2023
@wlandau
Copy link
Owner Author

wlandau commented May 18, 2023

Now fixed.

@wlandau
Copy link
Owner Author

wlandau commented Jan 23, 2024

For #124 crew stopped re-launching backlogged workers, but from the discussion in shikokuchuo/mirai#95, we need to return to re-launching them. Fortunately, the assigned and complete statistics from mirai are cumulative this time around, so this attempt should be straightforward.

@wlandau wlandau reopened this Jan 23, 2024
wlandau-lilly pushed a commit that referenced this issue Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants