Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force synchronous upload and reuse of possibly modified spawn outputs #23382

Closed
wants to merge 5 commits into from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Aug 21, 2024

When an action may modify a spawn's outputs after execution, the upload of outputs to the cache and reuse for deduplicated actions need to happen synchronously directly after spawn execution to avoid a race.

This commit implements this for cache uploads by marking all actions with this property and simply disabling async upload for all spawns executed by such actions.

For output reuse, all executions deduplicated against the first one register atomically upon deduplication and cause the cache upload to wait for all of them to complete reuse.

Fixes #22501
Fixes #23288
Work towards #21578
Closes #23307 (no longer needed)

@fmeum fmeum marked this pull request as ready for review August 21, 2024 14:31
@fmeum fmeum requested review from a team and lberki as code owners August 21, 2024 14:31
@fmeum fmeum requested review from gregestren and tjgq and removed request for a team, lberki and gregestren August 21, 2024 14:31
@github-actions github-actions bot added team-Performance Issues for Performance teams team-Configurability platforms, toolchains, cquery, select(), config transitions team-Rules-Java Issues for Java rules team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Aug 21, 2024
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 21, 2024

FYI @aranguyen: We don't need to modify anything about how paths are rewritten in Java actions to benefit from execution deduplication with this change. We should still look into getting rid of the in-place modification for direct classpath header compilation to avoid the synchronous fallback introduced by this change.

@fmeum
Copy link
Collaborator Author

fmeum commented Aug 21, 2024

@tjgq I have no idea how to test this short of running a large build in CI. I tested it manually on Bazel itself, with promising results: I no longer get any failures and build times improve noticeably.

Representative build results for bazel build //src/main/java/... with an initially empty disk cache on an M3 MacBook:

# Baseline
INFO: Elapsed time: 102.879s, Critical Path: 24.86s
INFO: 5128 processes: 145 internal, 3076 darwin-sandbox, 1907 worker.

# --experimental_worker_multiplex_sandboxing
INFO: Elapsed time: 115.551s, Critical Path: 29.83s
INFO: 5128 processes: 145 internal, 3076 darwin-sandbox, 1907 worker.

# --experimental_worker_multiplex_sandboxing --experimental_output_paths=strip, with deduplication patched out
INFO: Elapsed time: 95.024s, Critical Path: 27.01s
INFO: 5128 processes: 1438 disk cache hit, 145 internal, 2375 darwin-sandbox, 1170 worker.

# --experimental_worker_multiplex_sandboxing --experimental_output_paths=strip
INFO: Elapsed time: 80.242s, Critical Path: 22.04s
INFO: 5128 processes: 1367 disk cache hit, 145 internal, 2171 darwin-sandbox, 459 deduplicated, 986 worker.

I expect the sandboxing overhead to be lower on Linux, but even with the overhead, path mapping is substantially faster.

@tjgq tjgq self-assigned this Aug 21, 2024
Copy link
Contributor

@tjgq tjgq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I think we should be in good shape to flip --experimental_remote_cache_async once it's in.

Copy link
Contributor

@tjgq tjgq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to insist too much because I know this might be a lot of effort, but: is there some way we can write a test to verify that a fast, async-unsafe spawn waits for a slow, concurrent one to reuse its outputs before proceeding?

@fmeum
Copy link
Collaborator Author

fmeum commented Aug 23, 2024

I'm not going to insist too much because I know this might be a lot of effort, but: is there some way we can write a test to verify that a fast, async-unsafe spawn waits for a slow, concurrent one to reuse its outputs before proceeding?

I added a Java test that relies on mocking to make waitForAndReuseOutputs controllably slow. I don't think it's guaranteed to catch the issue if it comes back, but I verified that removing the call to awaitAllOutputReuse makes it fail.

The problem is that we can't just make the other spawn slow (it doesn't run anyway), we need to make its output reuse slow. A more realistic test would be possible if we could somehow delay file system operations. Is that something that the test filesystem supports (say, stall updates to /exec/root/bazel-bin/k8-opt/bin/output on a synchronization barrier)?

@fmeum fmeum requested a review from tjgq August 23, 2024 11:21
@tjgq
Copy link
Contributor

tjgq commented Aug 26, 2024

I'm not going to insist too much because I know this might be a lot of effort, but: is there some way we can write a test to verify that a fast, async-unsafe spawn waits for a slow, concurrent one to reuse its outputs before proceeding?

I added a Java test that relies on mocking to make waitForAndReuseOutputs controllably slow. I don't think it's guaranteed to catch the issue if it comes back, but I verified that removing the call to awaitAllOutputReuse makes it fail.

The problem is that we can't just make the other spawn slow (it doesn't run anyway), we need to make its output reuse slow. A more realistic test would be possible if we could somehow delay file system operations. Is that something that the test filesystem supports (say, stall updates to /exec/root/bazel-bin/k8-opt/bin/output on a synchronization barrier)?

It might be possible with a custom FileSystem implementation; see https://cs.opensource.google/bazel/bazel/+/master:src/test/java/com/google/devtools/build/lib/remote/ActionInputPrefetcherTestBase.java;l=134;drc=3fe227ff418ef81eb5f221625bfd90f52ac70150 for inspiration.

@fmeum
Copy link
Collaborator Author

fmeum commented Aug 26, 2024

Thanks for the pointer, that looks pretty interesting. Since the main code path here goes through FileSystemUtils.copyFile and not a fixed method on Path, I think that the current approach may even be more robust though (I am looking into changing the implementation to use NIO).

I added an assert that should make it less likely that the test fails open.

Comment on lines 850 to 855
// Make it more likely to detect races by waiting for this thread to make as much progress as
// possible before letting the second spawn continue.
while (completeFirstSpawn.getState().compareTo(Thread.State.WAITING) < 0) {
Thread.sleep(10);
}
assertThat(completeFirstSpawn.getState()).isEqualTo(Thread.State.WAITING);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead use a latch to explicitly synchronize with the call to uploadOutputs (and avoid unnecessarily slowing down the test)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Note that with the latch we can no longer assert that the thread would naturally wait (in the call to awaitAllOutputReuse), so I removed that assert.

@fmeum fmeum requested a review from tjgq August 27, 2024 12:26
@tjgq tjgq added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Aug 27, 2024
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 27, 2024

@bazel-io fork 7.4.0

@github-actions github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Aug 27, 2024
@fmeum fmeum deleted the 23288-synchronize-reuse branch August 27, 2024 20:26
fmeum added a commit to fmeum/bazel that referenced this pull request Sep 19, 2024
When an action may modify a spawn's outputs after execution, the upload of outputs to the cache and reuse for deduplicated actions need to happen synchronously directly after spawn execution to avoid a race.

This commit implements this for cache uploads by marking all actions with this property and simply disabling async upload for all spawns executed by such actions.

For output reuse, all executions deduplicated against the first one register atomically upon deduplication and cause the cache upload to wait for all of them to complete reuse.

Fixes bazelbuild#22501
Fixes bazelbuild#23288
Work towards bazelbuild#21578
Closes bazelbuild#23307 (no longer needed)

Closes bazelbuild#23382.

PiperOrigin-RevId: 668101364
Change-Id: Ice75dbe14a7dd46e02ecb096d2b2a30940216356
@fmeum fmeum mentioned this pull request Sep 19, 2024
github-merge-queue bot pushed a commit that referenced this pull request Sep 19, 2024
Cherry-picks the following changes to implement output reuse:
* Deduplicate locally executed path mapped spawns (#22556)
* Fix local execution deduplication to work with optional outputs
(#23296)
* Force synchronous upload and reuse of possibly modified spawn outputs
(#23382)
* Add support for in-memory outputs to output reuse (#23422)

Fixes #23377
Fixes #23444
Fixes #23457
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Configurability platforms, toolchains, cquery, select(), config transitions team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team team-Rules-Java Issues for Java rules
Projects
None yet
2 participants