Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--experimental_output_paths=strip is not effective when actions are scheduled in parallel #21043

Closed
DavidANeil opened this issue Jan 26, 2024 · 8 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Local-Exec Issues and PRs for the Execution (Local) team type: feature request

Comments

@DavidANeil
Copy link
Contributor

DavidANeil commented Jan 26, 2024

Description of the feature request:

--experimental_output_paths=strip can allow for a exec and target spawn to have the same input digest, and therefore be cache hits. Unfortunately, because these actions frequently have identical or exactly duplicated input dependency graphs, it isn't uncommon that the two actions are scheduled at the same time, meaning the work is still duplicated.

I'm not totally certain on the solution, but my first bad idea is that the spawn strategy should, if this flag is enabled, check to see if any sibling spawns match the input digest of spawn that is about to start executing, and if so somehow "join" the spawn results.

Which category does this issue belong to?

Local Execution

What underlying problem are you trying to solve with this feature?

Our host and target platforms are identical, so all duplicated [for tool] actions are pure overhead. This flag seems promising at reducing that overhead, but we still frequently see duplicated actions since the "exec" and "target" dependency chains tend to stay in sync, since one branch falling behind due to chance of the scheduler means it gets cache hits until it catches back up and starts duplicating execution again.

@fmeum
Copy link
Collaborator

fmeum commented Jan 26, 2024

@coeuvre @tjgq If I understand the logic around RemoteCacheClient correctly, Bazel will not deduplicate inflight requests to the remote cache/executor - which makes sense without path mapping, as an action previously couldn't generate cache hits for other actions in the same build. Is this something you think could be changed easily, e.g. by tracking in-flight requests and returning a Future waiting for the result of the first one?

CC @gregestren

@fmeum
Copy link
Collaborator

fmeum commented Jan 26, 2024

To be more concrete, this is the test case that we would want to pass:

function test_path_stripping_cache_hit_for_parallel_action() {
  mkdir rules
  cat > rules/defs.bzl <<'EOF'
def _platforms_split_transition_impl(settings, attr):
    return [
        {"//command_line_option:platforms": "//rules:foo"},
        {"//command_line_option:platforms": "//rules:bar"},
    ]

_platforms_split_transition = transition(
    implementation = _platforms_split_transition_impl,
    inputs = [],
    outputs = ["//command_line_option:platforms"],
)

def _echo_impl(ctx):
    out = ctx.actions.declare_file(ctx.label.name)
    args = ctx.actions.args()
    args.add(out)
    ctx.actions.run_shell(
        outputs = [out],
        arguments = [args],
        command = "sleep 2 && echo '{}' > $1".format(ctx.attr.msg, out.path),
        execution_requirements = {"supports-path-mapping": "1"},
    )
    return [
        DefaultInfo(files = depset([out])),
    ]

echo = rule(
    _echo_impl,
    attrs = {
        "msg": attr.string(),
    },
)

def _cat_impl(ctx):
    out = ctx.actions.declare_file(ctx.label.name)
    inputs = [file for src in ctx.attr.src for file in src[DefaultInfo].files.to_list()]
    ctx.actions.run_shell(
        inputs = inputs,
        outputs = [out],
        command = "cat {} > {}".format(" ".join([f.path for f in inputs]), out.path),
    )
    return [
        DefaultInfo(files = depset([out])),
    ]

cat = rule(
    _cat_impl,
    attrs = {
        "src": attr.label(cfg = _platforms_split_transition),
    },
)
EOF
  cat > rules/BUILD << 'EOF'
platform(name = "foo", parents = ["@local_config_platform//:host"])
platform(name = "bar", parents = ["@local_config_platform//:host"])
EOF

  mkdir -p pkg
  cat > pkg/BUILD <<'EOF'
load("//rules:defs.bzl", "cat", "echo")

echo(
    name = "gen_src",
    msg = "Hello, Cache!",
)

cat(
    name = "cat",
    src = ":gen_src",
)
EOF

  bazel build \
    --experimental_output_paths=strip \
    --remote_executor=grpc://localhost:${worker_port} \
    //pkg:cat &> $TEST_log || fail "build failed unexpectedly"
  expect_log '2 remote[^ ]'
  expect_log '1 remote cache hit'
}

@tjgq
Copy link
Contributor

tjgq commented Jan 26, 2024

@coeuvre @tjgq If I understand the logic around RemoteCacheClient correctly, Bazel will not deduplicate inflight requests to the remote cache/executor - which makes sense without path mapping, as an action previously couldn't generate cache hits for other actions in the same build. Is this something you think could be changed easily, e.g. by tracking in-flight requests and returning a Future waiting for the result of the first one?

I think this makes sense, yes.

@iancha1992 iancha1992 added the team-Local-Exec Issues and PRs for the Execution (Local) team label Jan 26, 2024
@tjgq
Copy link
Contributor

tjgq commented Jan 27, 2024

Thinking some more about it: is this really worth doing if we assume that the remote executor is able to deduplicate identical concurrent requests on its own? The spec doesn't mandate it, but it does hint at the possibility, and I'd expect a well-optimized implementation to do so. (It's at least true of Google's internal equivalent of remote execution.)

On the Bazel side, we can't really save any pre-execution work (we need to calculate the action digest anyway to detect duplicates) nor post-execution work (we don't have the ability to deduplicate downloads of the same blob into many output locations, and even if we did, it would only matter for actions "built with the bytes"). So I think the gains would be solely in reduced (non-blob) network traffic, which I'm not sure is worth the extra complexity.

@fmeum
Copy link
Collaborator

fmeum commented Jan 27, 2024

That's a good point! Since deduplication within a single build is really just a special case of general deduplication across concurrent builds for a remote executor, it's better positioned to do this than Bazel.

However, this still leaves the case of local execution with a remote (or disk) cache. In this situation, I don't see how it would be possible for the cache to transparently deduplicate requests, given that it doesn't have control over whether and when the results of an action that incurred a cache miss will be uploaded.

@tjgq Do you maybe see a way to push this kind of deduplication into the cache? If not, would you consider that to be a sensible addition?

@DavidANeil I just noticed the "Local execution" category you chose for the issue. Are you using a remote cache together with local execution?

@tjgq
Copy link
Contributor

tjgq commented Jan 28, 2024

For local execution, I think the rough outline of "keep a map of futures for in-flight executions, resolve the future on action upload, block on the future when downloading an identical action" might be doable. But it's the sort of thing that I'd have to write to convince myself (for example, it's a little unclear to me what the failure path would look like).

@DavidANeil
Copy link
Contributor Author

We have some developers using local execution with remote cache.
Personally, I am using remote execution with the dynamic spawn strategy, (and --noexperimental_dynamic_exclude_tools)
And our CI system uses only remote execution (except for the few actions we no-remote-exec).

So, ideally the solution would work for all 3 of those modes.

@tjgq tjgq added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed untriaged labels Jan 30, 2024
@fmeum fmeum self-assigned this May 27, 2024
@fmeum fmeum added P1 I'll work on this now. (Assignee required) and removed P3 We're not considering working on this, but happy to review a PR. (No assignee) labels May 27, 2024
fmeum added a commit to fmeum/bazel that referenced this issue Jul 23, 2024
When path mapping is enabled, different `Spawn`s in the same build can have identical `RemoteAction.ActionKey`s and can thus provide remote cache hits for each other. However, cache hits are only possible after the first local execution has concluded and uploaded its result to the cache.

To avoid unnecessary duplication of local work, the first `Spawn` for each `RemoteAction.ActionKey` is tracked until its results have been uploaded and all other concurrently scheduled `Spawn`s wait for it and then copy over its local outputs.

Fixes bazelbuild#21043

Closes bazelbuild#22556.

PiperOrigin-RevId: 655097996
Change-Id: I4368f9210c67a306775164d252aae122d8b46f9b
github-merge-queue bot pushed a commit that referenced this issue Jul 29, 2024
When path mapping is enabled, different `Spawn`s in the same build can
have identical `RemoteAction.ActionKey`s and can thus provide remote
cache hits for each other. However, cache hits are only possible after
the first local execution has concluded and uploaded its result to the
cache.

To avoid unnecessary duplication of local work, the first `Spawn` for
each `RemoteAction.ActionKey` is tracked until its results have been
uploaded and all other concurrently scheduled `Spawn`s wait for it and
then copy over its local outputs.

Fixes #21043

Closes #22556.

PiperOrigin-RevId: 655097996
Change-Id: I4368f9210c67a306775164d252aae122d8b46f9b

Closes #23060
@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 7.3.0 RC1. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.3.0rc1. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Local-Exec Issues and PRs for the Execution (Local) team type: feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants