Always relaunch backlogged inactive workers #79

wlandau · 2023-05-18T01:50:38Z

It is not enough to simply prioritize backlogged workers above non-backlogged ones in controller$scale(). Regardless of the task load, controller$try_launch() must always try to re-launch inactive backlogged workers as determined by inactive & (controller$router$assigned > controller$router$complete). The following test is failing because of controller$try_launch() is not aggressive enough on this point. About 8-10 tasks hang because the workers they are assigned to don't make it back to launch. This happens to the trailing tasks at the end of a pipeline because the task load is too low by then and there is no longer explicit demand for those workers due to the presence of the other online workers. (Those online workers happen to be the wrong workers for the assigned tasks at that point). When I tried manually relaunching just the correct workers, the stuck tasks completed.

I think this is a recurrence of #75, though it's odd that I am seeing overt hanging now, as opposed to slowness ~~as those once stuck tasks get reassigned~~ (slowness because my earlier versions of this test capped idle time, which allowed workers to rotate more easily and thus backlogged workers could percolate to the top for relaunch). ~~The only changes were throttling in crew, which shouldn't affect this, and mirai shikokuchuo/mirai@e270057. Doesn't necessarily point to an issue in mirai, but it is odd.~~ (Those changes don’t matter for this issue, only the change in my test which exposed more of #75.)

But anyway, the solution to this issue will be good to have anyway. Right after @shikokuchuo solved #75, the trailing hanging tasks did end up completing, but only after several long seconds of delay. I think I can eliminate that delay entirely by more aggressively launching backlogged inactive workers on crew's end.

library(crew)
controller <- crew_controller_local(
  workers = 20L,
  tasks_max = 100,
  seconds_exit = 10
)
controller$start()
names <- character(0L)
index <- 0L
n_tasks <- 6000L
while (index < n_tasks || !(controller$empty())) {
  if (index < n_tasks) {
    index <- index + 1L
    cat("submit", index, "\n")
    controller$push(
      name = as.character(index),
      command = TRUE
    )
  }
  out <- controller$pop()
  if (!is.null(out)) {
    cat("collect", out$name, "\n")
    names[[length(names) + 1L]] <- out$name
  }
}

# unresolved tasks
lapply(controller$queue, function(x) x$handle[[1L]]$data)

# daemons
controller$router$poll()
controller$router$daemons

controller$terminate()

The text was updated successfully, but these errors were encountered:

wlandau · 2023-05-18T01:52:45Z

I think this is a recurrence of #75, though it's odd that I am seeing overt hanging now, as opposed to slowness as those once stuck tasks get reassigned. The only changes were throttling in crew, which shouldn't affect this, and mirai shikokuchuo/mirai@e270057. Doesn't necessarily point to an issue in mirai, but it is odd.

Ah, the other change could have been how I was testing it. Previously, I was focused on idle time. The test above sets infinite idle time but max tasks equal to 100, which is a better way to expose this remaining problem in crew's auto-scaling logic.

wlandau · 2023-05-18T01:54:47Z

Tomorrow, I will tackle this one and #78. Maybe there is something I can do for #77, but I am still scratching my head on that one. The latest at https://github.com/ropensci/targets/actions/runs/5009290193 shows an "object closed" error from a task, as well as a 360-second timeout in a different test.

shikokuchuo · 2023-05-18T07:31:36Z

It is not enough to simply prioritize backlogged workers above non-backlogged ones in controller$scale(). Regardless of the task load, controller$try_launch() must always try to re-launch inactive backlogged workers as determined by inactive & (controller$router$assigned > controller$router$complete). The following test is failing because of controller$try_launch() is not aggressive enough on this point. About 8-10 tasks hang because the workers they are assigned to don't make it back to launch. This happens to the trailing tasks at the end of a pipeline because the task load is too low by then and there is no longer explicit demand for those workers due to the presence of the other online workers. (Those online workers happen to be the wrong workers for the assigned tasks at that point). When I tried manually relaunching just the correct workers, the stuck tasks completed.

Yes, that's right. I didn't emphasize it at the time, but it's what I meant by "always launch at least those servers" here: #75 (comment)

But anyway, the solution to this issue will be good to have anyway. Right after @shikokuchuo solved #75, the trailing hanging tasks did end up completing, but only after several long seconds of delay. I think I can eliminate that delay entirely by more aggressively launching backlogged inactive workers on crew's end.

Yes, I experienced this delay as well, which I thought a bit odd. But with the throttling implementation, this is now pretty much instant, at least for the previous tests.

wlandau · 2023-05-18T16:03:32Z

Now fixed.

wlandau · 2024-01-23T13:58:18Z

For #124 crew stopped re-launching backlogged workers, but from the discussion in shikokuchuo/mirai#95, we need to return to re-launching them. Fortunately, the assigned and complete statistics from mirai are cumulative this time around, so this attempt should be straightforward.

wlandau added the type: bug label May 18, 2023

wlandau self-assigned this May 18, 2023

wlandau added the order: 1 label May 18, 2023

shikokuchuo mentioned this issue May 18, 2023

Sporadic instances of 'errorValue' int 7 |Object closed #77

Closed

wlandau pushed a commit that referenced this issue May 18, 2023

Sketch #79

d4b5274

wlandau pushed a commit that referenced this issue May 18, 2023

sketch #79

0e12acb

wlandau closed this as completed May 18, 2023

wlandau mentioned this issue Jun 14, 2023

Hanging throughput test on Ubuntu #88

Closed

This was referenced Jul 24, 2023

Controller hangs perpetually if a Slurm worker is killed via OOM wlandau/crew.cluster#19

Closed

Set max consecutive futile launches #102

Merged

wlandau reopened this Jan 23, 2024

wlandau-lilly pushed a commit that referenced this issue Jan 23, 2024

Add failing local test of #79

67362ff

wlandau-lilly closed this as completed in 6e07ac1 Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always relaunch backlogged inactive workers #79

Always relaunch backlogged inactive workers #79

wlandau commented May 18, 2023 •

edited

Loading

wlandau commented May 18, 2023 •

edited

Loading

wlandau commented May 18, 2023

shikokuchuo commented May 18, 2023

wlandau commented May 18, 2023

wlandau commented Jan 23, 2024

Always relaunch backlogged inactive workers #79

Always relaunch backlogged inactive workers #79

Comments

wlandau commented May 18, 2023 • edited Loading

wlandau commented May 18, 2023 • edited Loading

wlandau commented May 18, 2023

shikokuchuo commented May 18, 2023

wlandau commented May 18, 2023

wlandau commented Jan 23, 2024

wlandau commented May 18, 2023 •

edited

Loading

wlandau commented May 18, 2023 •

edited

Loading