remote/correctness: Bazel doesn't track system libraries. #4558

buchgr · 2018-02-01T15:15:48Z

Bazel currently does not track tools outside a workspace. This can be a problem if, for example, an action uses a compiler from /usr/bin/. Then, two users with different
compilers installed will wrongly share cache hits because the outputs are different but they have the same action hash.

The text was updated successfully, but these errors were encountered:

BenTheElder · 2018-02-03T02:27:40Z

Is there a workaround for this currently?
Can we just do something like --action_env=COMPILER=<compiler version>?

buchgr · 2018-02-03T22:41:45Z

@BenTheElder Yep exactly. It's not just compilers though it's any tools that your build uses from your system that you did not explicitly specify really. So you might want to include some more info than just your compiler.

edbaunton · 2018-02-04T14:01:05Z

Would the use of toolchains help with this?

ulfjack · 2018-02-05T09:35:29Z

Our naive proposal is to allow each toolchain to add a toolchain identifier into the cache key(s). The risk is that a bug (or a design flaw) in the toolchain might cause incorrect caching, but, on the other hand, it leaves the door open to toolchains that want to cache across platforms. E.g., a Java 7 compiler on Mac and Linux outputs essentially the same files, so caching those might be safe even if the binaries are not bit identical.

BenTheElder · 2018-02-12T22:11:11Z

I forgot to follow up here: setting --action_env doesn't seem to be sufficient since not all actions consume this (?) I'm currently working around this by just making sure different caches are used for different toolchain versions.

buchgr · 2018-02-12T22:20:47Z

@BenTheElder It’s my understanding that this would be a bug in Bazel, correct
@ulfjack? Do you have a reproducer?

Something that works for sure with every action is —experimental_remote_platform_override. It takes as a value the text representation of the Platform message defined in the remote_execution.proto. It’s effectively just a key value pair.

BenTheElder · 2018-02-12T22:31:08Z

I don't have a great reproducer currently, in https://github.com/kubernetes/test-infra I modify the bazelrc to point at a local copy of bazel-remote-cache from within a debian+bazel docker container and then I build again from another container after upgrading the GCC version to gcc-7.

When doing this I noticed stale headers from gcc-4.9 being included as part of compiling https://github.com/google/protobuf as a dependency despite setting an --action_env=CC_VERSION=<exact version of the gcc package> each time so I ran the bulid with -s and didn't see the env being passed to the failed compilation step.

@buchgr where is --experimental_remote_platform_override documented?

ulfjack · 2018-02-13T09:53:18Z

See #3320. :-/

buchgr · 2018-02-13T12:13:33Z

@BenTheElder ohh I apologize for the misinformation - it was my understanding that --action_env is supported by all actions. Please use the platform override then - I shall document it on our website.

Essentially, you can pass it the following protobuf:

--experimental_remote_platform_override='properties:{name:"key1" value:"value1" name:"key2" value:"value2"}'

This is passed through to the computation of the action key for remote caching.

lisendong · 2018-02-14T08:40:24Z

when I build bazel from source ( bazel-0.10.0), I met the system env var problem:

ERROR: /home/mpi/tensorflow/downloads/bazel-0.10.0/src/main/protobuf/BUILD:101:1: error executing shell command: 'cp 'bazel-out/k8-opt/bin/src/main/protobuf/command_server_java_grpc_srcs.jar' 'bazel-out/k8-opt/bin/src/main/protobuf/command_server_java_grpc_srcs.srcjar'' failed (Exit 127): bash failed: error executing command
  (cd /tmp/bazel_vAIW1oPT/out/execroot/io_bazel && \
  exec env - \
  /bin/bash -c 'cp '\''bazel-out/k8-opt/bin/src/main/protobuf/command_server_java_grpc_srcs.jar'\'' '\''bazel-out/k8-opt/bin/src/main/protobuf/command_server_java_grpc_srcs.srcjar'\''')
/bin/bash: cp: command not found
Target //src:bazel failed to build
INFO: Elapsed time: 55.301s, Critical Path: 39.40s
FAILED: Build did NOT complete successfully

because in my system, "cp" command is under ~/bin/ but not /bin (and I do not have root privileges)
how could I set the $PATH enviromnt?

I have tried "--action_env=PATH" but it does not work

gkossakowski · 2018-02-21T19:37:33Z

E.g., a Java 7 compiler on Mac and Linux outputs essentially the same files, so caching those might be safe even if the binaries are not bit identical.
@ulfjack do you have plans for supporting caching of jdk-derived (e.g. javac compiled) outputs across platforms? is it possible to override the toolchain even in a hacky way today?

ulfjack · 2018-02-22T08:03:54Z

The current design does not make it easy to do cross-platform caching of platform-independent outputs (e.g., Java bytecode); we're currently adding in the full command-line and env variables, which will often differ (e.g., Windows command line contains backward slashes vs. Linux uses forward slashes).

Right now, if all the input files are platform-independent, and the command-lines / env variables happen to be identical (e.g., this could happen for Linux / MacOS with the right flags), then the current cache will return a cache hit - that's actually a correctness issue right now, since we can't guarantee that the outputs are actually platform independent, i.e., this would even happen for an action involving gcc if the command line, env variables, and inputs happen to be identical. I believe that we actually have a bug report for that somewhere.

Our rough plan for solving this is to add a toolchain identifier into the cache key to prevent such cache hits (this isn't a fully baked proposal yet), and I expect that we'll allow users to override those / provide identifiers that happen to be identical across platforms (handwave).

Getting cache hits across Windows and Linux / MacOS platforms is a more complex topic that we haven't really started to look into yet. We'll need to figure something out about paths for sure, but there are more problems. In particular, what immediately comes to mind is that line endings make source files not bit-for-bit identical so we can't use a straightforward hash, or we have to require users to use a consistent line ending convention across multiple platforms. Another problem is case sensitivity of file systems (or lack thereof), and the differences in command-line flags (e.g., Windows usually uses /flag, whereas Linux usually uses -flag or --flag).

gkossakowski · 2018-02-27T00:39:45Z

@ulfjack thanks for the detailed response! This ticket gave me an idea how we could actually share outputs across platforms. I opened a new issue #4714 to discuss it. I'd appreciate the input.

BenTheElder · 2018-05-31T21:22:01Z

Following back up on this after a while, FWIW we've been working around this for kubernetes for a while now by hashing the toolchains ourselves (since we run in debian container(s) we can do this pretty easily) and using this to key our cache.

So far this has worked well enough as a stopgap.

drigz · 2019-02-28T15:48:49Z

@buchgr
--experimental_remote_platform_override is gone in 0.23.0. --remote_default_platform_properties seems to be the successor, but It's not clear if it will work in the presence of platform autodetection. Can you recommend a new workaround?

For now I will try `--remote_http_cache=https://${cache_host}/${extra_cache_key}", inspired by @BenTheElder's link.

HackAttack · 2019-05-07T17:35:28Z

It’s not obvious to me that this issue is specific to the remote cache; wouldn’t it affect the local cache as well? If not, why not?

buchgr · 2019-05-08T11:04:08Z

@HackAttack correct. however, the local cache affects only a single user while the remote cache can do quite a bit of harm.

GabrielGhe · 2019-07-08T20:14:11Z

I forgot to follow up here: setting --action_env doesn't seem to be sufficient since not all actions consume this (?) I'm currently working around this by just making sure different caches are used for different toolchain versions.

Could you elaborate on this? Which actions do not consume this and which do?

BenTheElder · 2019-07-08T20:21:31Z

Could you elaborate on this? Which actions do not consume this and which do?

To be honest after more than a year I don't really remember and other details have changed pretty significantly since then. I no longer have much activity related to this.

We've been reasonably happy with the distinct cache locations, FWIW. I think recent efforts are focused on using GCP RBE which uses the build container image as a key IIRC.

In some client setups, untracked local files can be used by an action without being included in the Action message, which causes action cache collisions: bazelbuild/bazel#4558 Ideally this should be fixed on the client side (either in the client, or in the build configuration), but it is not always easy to do in practice. As a workaround, this patch adds a setting to mangle ActionCache keys with the instance name provided by the client (if it is not empty), to produce a new ActionCache key. Clients are then able to specify a different instance name whenever a change is made that could affect these untracked inputs. The instance name value could be something like the hash of the compiler version. This allows multiple ActionCache items to exist in the cache, without requiring a change to the on-disk storage format. This feature is disabled by default, since it would cause cache invalidations for existing users. Fixes buchgr#15.

In some client setups, untracked local files can be used by an action without being included in the Action message, which causes action cache collisions: bazelbuild/bazel#4558 Ideally this should be fixed on the client side (either in the client, or in the build configuration), but it is not always easy to do in practice. As a workaround, this patch adds a setting to mangle ActionCache keys with the instance name provided by the client (if it is not empty), to produce a new ActionCache key. Clients are then able to specify a different instance name whenever a change is made that could affect these untracked inputs. The instance name value could be something like the hash of the compiler version. This allows multiple ActionCache items to exist in the cache, without requiring a change to the on-disk storage format. This feature is disabled by default, since it would cause cache invalidations for existing users. Fixes #15.

Flamefire · 2020-09-09T12:51:55Z

Could you elaborate on this? Which actions do not consume this and which do?

I'm recently seeing action_env being consumed by some cc_library actions and not by others. See #12059

This allows for manually invalidating prior cache results when there are incompatible changes that Bazel doesn't handle. For example, changing the C++ compiler version. See bazelbuild/bazel#4558

cameron-martin · 2022-10-03T20:55:52Z

Since bazel can consume anything from the system by default, it would seem natural to add some kind of extra input to all build actions that represents the complete state of the system, for example the hash of a docker container or the commit sha of the scripts that provision the system. This is similar to what was suggested above, with using --action_env, but instead of choosing specific properties such as compiler version you pass a hash of the whole system.

However, this falls down when using remote execution because this is specified by the machine invoking bazel rather than the remote execution service. The machine invoking bazel has no way of knowing which environment the remote execution service will use to execute the action.

aaronmondal · 2023-05-08T16:43:46Z

The solution we're using in rules_ll is to wrap the entire build environment in nix and generate remote execution toolchains from that. We then replicate the RBE environment locally and can seamlessly switch between RBE and local builds and reuse the same cache.

Upsides:

Lets us share caches between remote executors and local dev environments, since the remote execution container is built using the exact same tools as in the dev environment.
It's really easy to update the toolchain containers - it's basically just a nix flake update and the latest toolchains are used (currently clang 16).
Fully reproducible. Even across different systems as long as the CPU architecture is x86_64 and we're running on linux. Both the RBE container and the local devenv are binary-identical when built on e.g. a custom Gentoo and Ubuntu running in WSL2.
Local builds don't require containerization, i.e. they build at native speed without overhead.

Downsides:

Easy to use, but hard to modify. One needs to know their way around not only Bazel and remote execution, but also nix and the nix cc toolchains to customize behavior.
Since the RBE images are completely custom, a container registry becomes a hard requirement. In our case we're using a local k8s cluster, but running a regular docker registry should also work.
The RBE images are a bit larger than the "standard" bazel RBE images (~2.8 GB uncompressed, ~1.3 compressed in the registry).
We don't have a good way to version the RBE images yet. Nix flakes don't let you customize "input parameters" for purity reasons. This makes it a bit tricky to tag a toolchain image as a non-latest version.
If you need some non-bazel dependency from nixpkgs you'll need to modify the toolchain container and rebuild all toolchains. This is technically WAI, but there is probably a more elegant solution to this.

Implementation:

The Bazel wrapper
The RBE image
The k8s cluster hosting a temporary container registry
The command to run the toolchain rebuild

xkszltl · 2024-01-26T14:45:40Z

Does this only apply to the toolchain, or including external libs as well?
By external libs I mean:

C++ headers
-lxxx in linkopts
Lib file imported in whatever trackable way to bazel.

buchgr assigned philwo, buchgr and ulfjack Feb 1, 2018

buchgr added the P2 We'll consider working on this in future. (Assignee optional) label Feb 1, 2018

BenTheElder mentioned this issue Feb 8, 2018

more fixes for image consolidation / bazel caching kubernetes/test-infra#6713

Merged

BenTheElder mentioned this issue Feb 13, 2018

Implement Bazel Remote Caching [Tracking Issue] kubernetes/test-infra#6808

Closed

12 tasks

gkossakowski mentioned this issue Feb 27, 2018

Sharing Java outputs across platforms (Linux/Mac) #4714

Closed

buchgr added the remote caching label Mar 7, 2018

buchgr changed the title ~~Bazel doesn't track system libraries.~~ remote/correctness: Bazel doesn't track system libraries. Mar 14, 2018

hchauvin mentioned this issue May 25, 2018

Custom toolchain for R grailbio/rules_r#1

Closed

philwo removed their assignment Jul 19, 2018

buchgr mentioned this issue Aug 6, 2018

Remote Cache should be safe to share between environments #5743

Closed

This was referenced Sep 11, 2018

ci: report compiler version when setting up cache envoyproxy/envoy#4404

Closed

ci: track bazel remote cache related flakiness envoyproxy/envoy#4407

Closed

ulfjack removed their assignment Nov 20, 2018

medivh-x mentioned this issue Dec 22, 2018

Compiles with GCC 4.8.3 but find the headers under GCC 5 and reports an error #6971

Closed

buchgr added type: process team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed category: http caching labels Jan 16, 2019

buchgr removed their assignment Jan 9, 2020

jamiesnape mentioned this issue Jun 12, 2020

Prevent false remote cache hits between builds on different operating systems RobotLocomotion/drake#13552

Closed

kragniz mentioned this issue Aug 20, 2020

Differentiate Action Cache objects by instance name buchgr/bazel-remote#148

Closed

mostynb mentioned this issue Aug 25, 2020

add option to mangle AC keys with (non-empty) instance names buchgr/bazel-remote#339

Merged

divanorama mentioned this issue Feb 8, 2021

[request] robust way to specify LD_LIBRARY_PATH for java/javac #12978

Closed

SanjayVas mentioned this issue Mar 11, 2021

Add cache-version input to bazel-build-test action. world-federation-of-advertisers/actions#8

Merged

jheidbrink mentioned this issue Apr 20, 2022

Investigate options for Bazel remote caches magma/magma#12458

Closed

milesdai mentioned this issue Sep 1, 2022

[ci] Ubuntu version mismatch between machines causes false cache hits lowRISC/opentitan#14695

Closed

cameron-martin mentioned this issue Oct 3, 2022

Can the remote cache be keyed based on the execution environment of the worker? buildfarm/buildfarm#1182

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remote/correctness: Bazel doesn't track system libraries. #4558

remote/correctness: Bazel doesn't track system libraries. #4558

buchgr commented Feb 1, 2018

BenTheElder commented Feb 3, 2018

buchgr commented Feb 3, 2018 •

edited

Loading

edbaunton commented Feb 4, 2018

ulfjack commented Feb 5, 2018

BenTheElder commented Feb 12, 2018

buchgr commented Feb 12, 2018

BenTheElder commented Feb 12, 2018

ulfjack commented Feb 13, 2018

buchgr commented Feb 13, 2018

lisendong commented Feb 14, 2018

gkossakowski commented Feb 21, 2018

ulfjack commented Feb 22, 2018

gkossakowski commented Feb 27, 2018

BenTheElder commented May 31, 2018 •

edited

Loading

drigz commented Feb 28, 2019

HackAttack commented May 7, 2019

buchgr commented May 8, 2019

GabrielGhe commented Jul 8, 2019

BenTheElder commented Jul 8, 2019

Flamefire commented Sep 9, 2020

cameron-martin commented Oct 3, 2022

aaronmondal commented May 8, 2023

xkszltl commented Jan 26, 2024

remote/correctness: Bazel doesn't track system libraries. #4558

remote/correctness: Bazel doesn't track system libraries. #4558

Comments

buchgr commented Feb 1, 2018

BenTheElder commented Feb 3, 2018

buchgr commented Feb 3, 2018 • edited Loading

edbaunton commented Feb 4, 2018

ulfjack commented Feb 5, 2018

BenTheElder commented Feb 12, 2018

buchgr commented Feb 12, 2018

BenTheElder commented Feb 12, 2018

ulfjack commented Feb 13, 2018

buchgr commented Feb 13, 2018

lisendong commented Feb 14, 2018

gkossakowski commented Feb 21, 2018

ulfjack commented Feb 22, 2018

gkossakowski commented Feb 27, 2018

BenTheElder commented May 31, 2018 • edited Loading

drigz commented Feb 28, 2019

HackAttack commented May 7, 2019

buchgr commented May 8, 2019

GabrielGhe commented Jul 8, 2019

BenTheElder commented Jul 8, 2019

Flamefire commented Sep 9, 2020

cameron-martin commented Oct 3, 2022

aaronmondal commented May 8, 2023

xkszltl commented Jan 26, 2024

buchgr commented Feb 3, 2018 •

edited

Loading

BenTheElder commented May 31, 2018 •

edited

Loading