Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race conditions with local action deduplication #23288

Closed
fmeum opened this issue Aug 13, 2024 · 7 comments
Closed

Race conditions with local action deduplication #23288

fmeum opened this issue Aug 13, 2024 · 7 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Configurability platforms, toolchains, cquery, select(), config transitions team-Local-Exec Issues and PRs for the Execution (Local) team type: bug

Comments

@fmeum
Copy link
Collaborator

fmeum commented Aug 13, 2024

While testing path mapping on Bazel itself with 7.3.0, I just encountered the following exceptions:

java.lang.RuntimeException: Unrecoverable error while evaluating node 'ActionLookupData0{actionLookupKey=ConfiguredTargetKey{label=//src/main/java/com/google/devtools/build/lib/sandbox/cgroups/controller:v2, config=BuildConfigurationKey[153e20f2e3739e0fdcb3b5e06c3ec8614ec4e43b5ff5f93c78bf2a56108ee671]}, actionIndex=0}' (requested by nodes 'ArtifactNestedSetKey[4]@42667357', 'ArtifactNestedSetKey[3]@612534160', 'ArtifactNestedSetKey[8]@776188326')
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:550)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: java.lang.IllegalStateException: Missing original path for mapped path bazel-out/darwin_arm64-fastbuild/bin/src/main/java/com/google/devtools/build/lib/sandbox/cgroups/controller/libcontroller-hjar.jar in bazel-out/darwin_arm64-opt-exec-ST-fad1763555eb/bin/src/main/java/com/google/devtools/build/lib/sandbox/cgroups/controller/libv2-hjar.jdeps
jdeps: dependency {
  path: "bazel-out/darwin_arm64-fastbuild/bin/src/main/java/com/google/devtools/build/lib/sandbox/cgroups/controller/libcontroller-hjar.jar"
  kind: EXPLICIT
}
rule_label: "//src/main/java/com/google/devtools/build/lib/sandbox/cgroups/controller:v2"
success: true

path map: {bazel-out/cfg/bin/external/rules_java~/toolchains/platformclasspath.jar=bazel-out/darwin_arm64-opt-exec-ST-fad1763555eb/bin/external/rules_java~/toolchains/platformclasspath.jar, bazel-out/cfg/bin/src/main/java/com/google/devtools/build/lib/sandbox/cgroups/controller/libcontroller-hjar.jar=bazel-out/darwin_arm64-opt-exec-ST-fad1763555eb/bin/src/main/java/com/google/devtools/build/lib/sandbox/cgroups/controller/libcontroller-hjar.jar, bazel-out/cfg/bin/external/rules_jvm_external~~maven~maven/com/google/errorprone/error_prone_annotations/2.24.1/header_error_prone_annotations-2.24.1.jar=bazel-out/darwin_arm64-opt-exec-ST-fad1763555eb/bin/external/rules_jvm_external~~maven~maven/com/google/errorprone/error_prone_annotations/2.24.1/header_error_prone_annotations-2.24.1.jar, bazel-out/cfg/bin/external/rules_jvm_external~~maven~maven/com/google/errorprone/error_prone_type_annotations/2.23.0/header_error_prone_type_annotations-2.23.0.jar=bazel-out/darwin_arm64-opt-exec-ST-fad1763555eb/bin/external/rules_jvm_external~~maven~maven/com/google/errorprone/error_prone_type_annotations/2.23.0/header_error_prone_type_annotations-2.23.0.jar, bazel-out/cfg/bin/external/rules_jvm_external~~maven~maven/com/github/stephenc/jcip/jcip-annotations/1.0-1/header_jcip-annotations-1.0-1.jar=bazel-out/darwin_arm64-opt-exec-ST-fad1763555eb/bin/external/rules_jvm_external~~maven~maven/com/github/stephenc/jcip/jcip-annotations/1.0-1/header_jcip-annotations-1.0-1.jar, bazel-out/cfg/bin/external/rules_jvm_external~~maven~maven/com/google/code/findbugs/jsr305/3.0.2/header_jsr305-3.0.2.jar=bazel-out/darwin_arm64-opt-exec-ST-fad1763555eb/bin/external/rules_jvm_external~~maven~maven/com/google/code/findbugs/jsr305/3.0.2/header_jsr305-3.0.2.jar, bazel-out/cfg/bin/external/rules_jvm_external~~maven~maven/com/google/guava/guava/33.0.0-jre/header_guava-33.0.0-jre.jar=bazel-out/darwin_arm64-opt-exec-ST-fad1763555eb/bin/external/rules_jvm_external~~maven~maven/com/google/guava/guava/33.0.0-jre/header_guava-33.0.0-jre.jar, bazel-out/cfg/bin/external/rules_jvm_external~~maven~maven/com/google/guava/failureaccess/1.0.2/header_failureaccess-1.0.2.jar=bazel-out/darwin_arm64-opt-exec-ST-fad1763555eb/bin/external/rules_jvm_external~~maven~maven/com/google/guava/failureaccess/1.0.2/header_failureaccess-1.0.2.jar, bazel-out/cfg/bin/external/rules_jvm_external~~maven~maven/com/google/guava/listenablefuture/9999.0-empty-to-avoid-conflict-with-guava/header_listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar=bazel-out/darwin_arm64-opt-exec-ST-fad1763555eb/bin/external/rules_jvm_external~~maven~maven/com/google/guava/listenablefuture/9999.0-empty-to-avoid-conflict-with-guava/header_listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar, bazel-out/cfg/bin/external/rules_jvm_external~~maven~maven/com/google/j2objc/j2objc-annotations/2.8/header_j2objc-annotations-2.8.jar=bazel-out/darwin_arm64-opt-exec-ST-fad1763555eb/bin/external/rules_jvm_external~~maven~maven/com/google/j2objc/j2objc-annotations/2.8/header_j2objc-annotations-2.8.jar, bazel-out/cfg/bin/external/rules_jvm_external~~maven~maven/org/checkerframework/checker-qual/3.42.0/header_checker-qual-3.42.0.jar=bazel-out/darwin_arm64-opt-exec-ST-fad1763555eb/bin/external/rules_jvm_external~~maven~maven/org/checkerframework/checker-qual/3.42.0/header_checker-qual-3.42.0.jar}
	at com.google.devtools.build.lib.rules.java.JavaCompileAction.createFullOutputDeps(JavaCompileAction.java:796)
	at com.google.devtools.build.lib.rules.java.JavaHeaderCompileAction.afterExecute(JavaHeaderCompileAction.java:147)
	at com.google.devtools.build.lib.analysis.actions.SpawnAction.execute(SpawnAction.java:262)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1159)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1076)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:165)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:94)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:573)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:862)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:334)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:172)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:461)
	... 7 more

and

java.io.FileNotFoundException: /private/var/tmp/_bazel_fmeum/507738cfc7e6cde00e4a0230e9aa0722/execroot/_main/bazel-out/_tmp/actions/remote/175.tmp (No such file or directory)
	at com.google.devtools.build.lib.unix.NativePosixFiles.lstat(Native Method)
	at com.google.devtools.build.lib.unix.UnixFileSystem.statInternal(UnixFileSystem.java:212)
	at com.google.devtools.build.lib.unix.UnixFileSystem.stat(UnixFileSystem.java:201)
	at com.google.devtools.build.lib.vfs.Path.stat(Path.java:290)
	at com.google.devtools.build.lib.vfs.FileSystemUtils.moveFile(FileSystemUtils.java:456)
	at com.google.devtools.build.lib.remote.RemoteExecutionService.moveOutputsToFinalLocation(RemoteExecutionService.java:878)
	at com.google.devtools.build.lib.remote.RemoteExecutionService.waitForAndReuseOutputs(RemoteExecutionService.java:1473)
	at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:169)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:151)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:118)
	at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
	at com.google.devtools.build.lib.rules.java.JavaCompileAction.execute(JavaCompileAction.java:425)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1155)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1068)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:166)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:95)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:559)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:902)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:190)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:467)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:435)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1423)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)

Originally posted by @fmeum in #22658 (comment)

@fmeum fmeum self-assigned this Aug 13, 2024
@fmeum fmeum added P1 I'll work on this now. (Assignee required) team-Configurability platforms, toolchains, cquery, select(), config transitions team-Local-Exec Issues and PRs for the Execution (Local) team type: bug labels Aug 13, 2024
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 13, 2024

@bazel-io fork 7.3.1

@fmeum
Copy link
Collaborator Author

fmeum commented Aug 14, 2024

The second issue is fixed by #23296.

The first one is more difficult to solve: the Java compilation action runs two spawns and modifies or deletes outputs in multiple cases, both before falling back to the second spawn and after each spawn to rewrite the .jdeps file.

@tjgq Do you happen to know how remote cache uploads interact with this logic? It seems like either they are robust against this kind of race and then local execution deduplication could reuse the same logic or they are potentially affected by this as well.

@fmeum
Copy link
Collaborator Author

fmeum commented Aug 14, 2024

With the small fix in #23296 as well as #23307, which temporarily disables deduplication for Java actions, this problem should be gone at HEAD. If they get merged in time, we could even cherry-pick the rollforward with these two fixes into 7.3.1.

@fmeum
Copy link
Collaborator Author

fmeum commented Aug 14, 2024

@bazel-io fork 7.3.1

@fmeum
Copy link
Collaborator Author

fmeum commented Aug 14, 2024

@iancha1992 Please consider this a (very) soft blocker for 7.3.1 only: it would be great to get in if ready in time, but there is no need to block the release for it.

@tjgq
Copy link
Contributor

tjgq commented Aug 20, 2024

@tjgq Do you happen to know how remote cache uploads interact with this logic? It seems like either they are robust against this kind of race and then local execution deduplication could reuse the same logic or they are potentially affected by this as well.

I would not expect remote uploads to be robust; the upload logic assumes that, once an output file is created, it will not disappear for the remainder of the build.

This is also a problem for async uploading. At some point I'd like to find a general solution, but I think for the time being we should special-case our way out of it (as you're already doing in #23307).

copybara-service bot pushed a commit that referenced this issue Aug 21, 2024
This fixes failures such as the following when a spawn, e.g. a reduced Java compilation spawn, doesn't create an optional output:

```
java.io.FileNotFoundException: /private/var/tmp/_bazel_fmeum/507738cfc7e6cde00e4a0230e9aa0722/execroot/_main/bazel-out/_tmp/actions/remote/175.tmp (No such file or directory)
	at com.google.devtools.build.lib.unix.NativePosixFiles.lstat(Native Method)
	at com.google.devtools.build.lib.unix.UnixFileSystem.statInternal(UnixFileSystem.java:212)
	at com.google.devtools.build.lib.unix.UnixFileSystem.stat(UnixFileSystem.java:201)
	at com.google.devtools.build.lib.vfs.Path.stat(Path.java:290)
	at com.google.devtools.build.lib.vfs.FileSystemUtils.moveFile(FileSystemUtils.java:456)
	at com.google.devtools.build.lib.remote.RemoteExecutionService.moveOutputsToFinalLocation(RemoteExecutionService.java:878)
        ...
```

Also clean up temporary files in case of an exception.

Work towards #23288

Closes #23296.

PiperOrigin-RevId: 665744936
Change-Id: I89a409c7a6b28b2a5fa532bdb233dca9bc5bde73
fmeum added a commit to fmeum/bazel that referenced this issue Sep 19, 2024
This fixes failures such as the following when a spawn, e.g. a reduced Java compilation spawn, doesn't create an optional output:

```
java.io.FileNotFoundException: /private/var/tmp/_bazel_fmeum/507738cfc7e6cde00e4a0230e9aa0722/execroot/_main/bazel-out/_tmp/actions/remote/175.tmp (No such file or directory)
	at com.google.devtools.build.lib.unix.NativePosixFiles.lstat(Native Method)
	at com.google.devtools.build.lib.unix.UnixFileSystem.statInternal(UnixFileSystem.java:212)
	at com.google.devtools.build.lib.unix.UnixFileSystem.stat(UnixFileSystem.java:201)
	at com.google.devtools.build.lib.vfs.Path.stat(Path.java:290)
	at com.google.devtools.build.lib.vfs.FileSystemUtils.moveFile(FileSystemUtils.java:456)
	at com.google.devtools.build.lib.remote.RemoteExecutionService.moveOutputsToFinalLocation(RemoteExecutionService.java:878)
        ...
```

Also clean up temporary files in case of an exception.

Work towards bazelbuild#23288

Closes bazelbuild#23296.

PiperOrigin-RevId: 665744936
Change-Id: I89a409c7a6b28b2a5fa532bdb233dca9bc5bde73
fmeum added a commit to fmeum/bazel that referenced this issue Sep 19, 2024
When an action may modify a spawn's outputs after execution, the upload of outputs to the cache and reuse for deduplicated actions need to happen synchronously directly after spawn execution to avoid a race.

This commit implements this for cache uploads by marking all actions with this property and simply disabling async upload for all spawns executed by such actions.

For output reuse, all executions deduplicated against the first one register atomically upon deduplication and cause the cache upload to wait for all of them to complete reuse.

Fixes bazelbuild#22501
Fixes bazelbuild#23288
Work towards bazelbuild#21578
Closes bazelbuild#23307 (no longer needed)

Closes bazelbuild#23382.

PiperOrigin-RevId: 668101364
Change-Id: Ice75dbe14a7dd46e02ecb096d2b2a30940216356
@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 7.4.0 RC1. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.4.0rc1. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Configurability platforms, toolchains, cquery, select(), config transitions team-Local-Exec Issues and PRs for the Execution (Local) team type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants