--experimental_output_paths=strip is not effective when actions are scheduled in parallel #21043

DavidANeil · 2024-01-26T05:07:01Z

Description of the feature request:

--experimental_output_paths=strip can allow for a exec and target spawn to have the same input digest, and therefore be cache hits. Unfortunately, because these actions frequently have identical or exactly duplicated input dependency graphs, it isn't uncommon that the two actions are scheduled at the same time, meaning the work is still duplicated.

I'm not totally certain on the solution, but my first bad idea is that the spawn strategy should, if this flag is enabled, check to see if any sibling spawns match the input digest of spawn that is about to start executing, and if so somehow "join" the spawn results.

Which category does this issue belong to?

Local Execution

What underlying problem are you trying to solve with this feature?

Our host and target platforms are identical, so all duplicated [for tool] actions are pure overhead. This flag seems promising at reducing that overhead, but we still frequently see duplicated actions since the "exec" and "target" dependency chains tend to stay in sync, since one branch falling behind due to chance of the scheduler means it gets cache hits until it catches back up and starts duplicating execution again.

The text was updated successfully, but these errors were encountered:

fmeum · 2024-01-26T10:14:13Z

@coeuvre @tjgq If I understand the logic around RemoteCacheClient correctly, Bazel will not deduplicate inflight requests to the remote cache/executor - which makes sense without path mapping, as an action previously couldn't generate cache hits for other actions in the same build. Is this something you think could be changed easily, e.g. by tracking in-flight requests and returning a Future waiting for the result of the first one?

CC @gregestren

fmeum · 2024-01-26T10:40:32Z

To be more concrete, this is the test case that we would want to pass:

function test_path_stripping_cache_hit_for_parallel_action() {
  mkdir rules
  cat > rules/defs.bzl <<'EOF'
def _platforms_split_transition_impl(settings, attr):
    return [
        {"//command_line_option:platforms": "//rules:foo"},
        {"//command_line_option:platforms": "//rules:bar"},
    ]

_platforms_split_transition = transition(
    implementation = _platforms_split_transition_impl,
    inputs = [],
    outputs = ["//command_line_option:platforms"],
)

def _echo_impl(ctx):
    out = ctx.actions.declare_file(ctx.label.name)
    args = ctx.actions.args()
    args.add(out)
    ctx.actions.run_shell(
        outputs = [out],
        arguments = [args],
        command = "sleep 2 && echo '{}' > $1".format(ctx.attr.msg, out.path),
        execution_requirements = {"supports-path-mapping": "1"},
    )
    return [
        DefaultInfo(files = depset([out])),
    ]

echo = rule(
    _echo_impl,
    attrs = {
        "msg": attr.string(),
    },
)

def _cat_impl(ctx):
    out = ctx.actions.declare_file(ctx.label.name)
    inputs = [file for src in ctx.attr.src for file in src[DefaultInfo].files.to_list()]
    ctx.actions.run_shell(
        inputs = inputs,
        outputs = [out],
        command = "cat {} > {}".format(" ".join([f.path for f in inputs]), out.path),
    )
    return [
        DefaultInfo(files = depset([out])),
    ]

cat = rule(
    _cat_impl,
    attrs = {
        "src": attr.label(cfg = _platforms_split_transition),
    },
)
EOF
  cat > rules/BUILD << 'EOF'
platform(name = "foo", parents = ["@local_config_platform//:host"])
platform(name = "bar", parents = ["@local_config_platform//:host"])
EOF

  mkdir -p pkg
  cat > pkg/BUILD <<'EOF'
load("//rules:defs.bzl", "cat", "echo")

echo(
    name = "gen_src",
    msg = "Hello, Cache!",
)

cat(
    name = "cat",
    src = ":gen_src",
)
EOF

  bazel build \
    --experimental_output_paths=strip \
    --remote_executor=grpc://localhost:${worker_port} \
    //pkg:cat &> $TEST_log || fail "build failed unexpectedly"
  expect_log '2 remote[^ ]'
  expect_log '1 remote cache hit'
}

tjgq · 2024-01-26T11:11:05Z

@coeuvre @tjgq If I understand the logic around RemoteCacheClient correctly, Bazel will not deduplicate inflight requests to the remote cache/executor - which makes sense without path mapping, as an action previously couldn't generate cache hits for other actions in the same build. Is this something you think could be changed easily, e.g. by tracking in-flight requests and returning a Future waiting for the result of the first one?

I think this makes sense, yes.

tjgq · 2024-01-27T14:08:06Z

Thinking some more about it: is this really worth doing if we assume that the remote executor is able to deduplicate identical concurrent requests on its own? The spec doesn't mandate it, but it does hint at the possibility, and I'd expect a well-optimized implementation to do so. (It's at least true of Google's internal equivalent of remote execution.)

On the Bazel side, we can't really save any pre-execution work (we need to calculate the action digest anyway to detect duplicates) nor post-execution work (we don't have the ability to deduplicate downloads of the same blob into many output locations, and even if we did, it would only matter for actions "built with the bytes"). So I think the gains would be solely in reduced (non-blob) network traffic, which I'm not sure is worth the extra complexity.

fmeum · 2024-01-27T15:06:07Z

That's a good point! Since deduplication within a single build is really just a special case of general deduplication across concurrent builds for a remote executor, it's better positioned to do this than Bazel.

However, this still leaves the case of local execution with a remote (or disk) cache. In this situation, I don't see how it would be possible for the cache to transparently deduplicate requests, given that it doesn't have control over whether and when the results of an action that incurred a cache miss will be uploaded.

@tjgq Do you maybe see a way to push this kind of deduplication into the cache? If not, would you consider that to be a sensible addition?

@DavidANeil I just noticed the "Local execution" category you chose for the issue. Are you using a remote cache together with local execution?

tjgq · 2024-01-28T20:18:11Z

For local execution, I think the rough outline of "keep a map of futures for in-flight executions, resolve the future on action upload, block on the future when downloading an identical action" might be doable. But it's the sort of thing that I'd have to write to convince myself (for example, it's a little unclear to me what the failure path would look like).

DavidANeil · 2024-01-29T18:06:53Z

We have some developers using local execution with remote cache.
Personally, I am using remote execution with the dynamic spawn strategy, (and --noexperimental_dynamic_exclude_tools)
And our CI system uses only remote execution (except for the few actions we no-remote-exec).

So, ideally the solution would work for all 3 of those modes.

When path mapping is enabled, different `Spawn`s in the same build can have identical `RemoteAction.ActionKey`s and can thus provide remote cache hits for each other. However, cache hits are only possible after the first local execution has concluded and uploaded its result to the cache. To avoid unnecessary duplication of local work, the first `Spawn` for each `RemoteAction.ActionKey` is tracked until its results have been uploaded and all other concurrently scheduled `Spawn`s wait for it and then copy over its local outputs. Fixes bazelbuild#21043 Closes bazelbuild#22556. PiperOrigin-RevId: 655097996 Change-Id: I4368f9210c67a306775164d252aae122d8b46f9b

When path mapping is enabled, different `Spawn`s in the same build can have identical `RemoteAction.ActionKey`s and can thus provide remote cache hits for each other. However, cache hits are only possible after the first local execution has concluded and uploaded its result to the cache. To avoid unnecessary duplication of local work, the first `Spawn` for each `RemoteAction.ActionKey` is tracked until its results have been uploaded and all other concurrently scheduled `Spawn`s wait for it and then copy over its local outputs. Fixes #21043 Closes #22556. PiperOrigin-RevId: 655097996 Change-Id: I4368f9210c67a306775164d252aae122d8b46f9b Closes #23060

iancha1992 · 2024-07-29T22:56:43Z

A fix for this issue has been included in Bazel 7.3.0 RC1. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.3.0rc1. Thanks!

DavidANeil added type: feature request untriaged labels Jan 26, 2024

DavidANeil assigned iancha1992, rakshitasingh05 and sgowroji Jan 26, 2024

iancha1992 unassigned sgowroji, iancha1992 and rakshitasingh05 Jan 26, 2024

iancha1992 added the team-Local-Exec Issues and PRs for the Execution (Local) team label Jan 26, 2024

tjgq added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed untriaged labels Jan 30, 2024

fmeum self-assigned this May 27, 2024

fmeum added P1 I'll work on this now. (Assignee required) and removed P3 We're not considering working on this, but happy to review a PR. (No assignee) labels May 27, 2024

fmeum mentioned this issue May 27, 2024

Deduplicate locally executed path mapped spawns #22556

Closed

copybara-service bot closed this as completed in 17eadaf Jul 23, 2024

fmeum mentioned this issue Jul 23, 2024

[7.3.0] Deduplicate locally executed path mapped spawns #23069

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--experimental_output_paths=strip is not effective when actions are scheduled in parallel #21043

--experimental_output_paths=strip is not effective when actions are scheduled in parallel #21043

DavidANeil commented Jan 26, 2024 •

edited

Loading

fmeum commented Jan 26, 2024 •

edited

Loading

fmeum commented Jan 26, 2024 •

edited

Loading

tjgq commented Jan 26, 2024

tjgq commented Jan 27, 2024

fmeum commented Jan 27, 2024

tjgq commented Jan 28, 2024

DavidANeil commented Jan 29, 2024

iancha1992 commented Jul 29, 2024

--experimental_output_paths=strip is not effective when actions are scheduled in parallel #21043

--experimental_output_paths=strip is not effective when actions are scheduled in parallel #21043

Comments

DavidANeil commented Jan 26, 2024 • edited Loading

Description of the feature request:

Which category does this issue belong to?

What underlying problem are you trying to solve with this feature?

fmeum commented Jan 26, 2024 • edited Loading

fmeum commented Jan 26, 2024 • edited Loading

tjgq commented Jan 26, 2024

tjgq commented Jan 27, 2024

fmeum commented Jan 27, 2024

tjgq commented Jan 28, 2024

DavidANeil commented Jan 29, 2024

iancha1992 commented Jul 29, 2024

DavidANeil commented Jan 26, 2024 •

edited

Loading

fmeum commented Jan 26, 2024 •

edited

Loading

fmeum commented Jan 26, 2024 •

edited

Loading