-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osx_cc_wrapper: add workaround to get_realpath for some bash versions #20212
osx_cc_wrapper: add workaround to get_realpath for some bash versions #20212
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
It looks like in some bash versions under some conditions with `set -e` bash may propagate sub-shell errors to parent shell, but it shouldn't as `set -E` is the setting about sub-shell error propagation. This manifests itself as `get_realpath` returning empty string. Let's add explicit `|| true` to `$(readlink ...)` sub-shell. The loop uses empty string as stop signal and should keep working for both affected and unaffected bash versions. EXTRA INFO Adding `set -x` to debug it reveals that `next` is being set to empty string but after that `echo "$previous"` isn't called. Which in turn calls `install_name_tool` with missing argument and fails the build. Example of the log https://hydra.nixos.org/build/240805256/nixlog/1 it uses `bazel-6.4.0` to run cpp tests from `bazelbuild/examples` in `x86_64-darwin` host. Example of local run with `set -x` ``` ++ get_otool_path bazel-out/darwin-fastbuild/bin/_solib_darwin/libexamples_Scpp_Slibhello-lib.dylib ++ get_realpath bazel-out/darwin-fastbuild/bin/_solib_darwin/libexamples_Scpp_Slibhello-lib.dylib ++ local previous=bazel-out/darwin-fastbuild/bin/_solib_darwin/libexamples_Scpp_Slibhello-lib.dylib ++ sed 's|^.*/bazel-out/|bazel-out/|' +++ readlink bazel-out/darwin-fastbuild/bin/_solib_darwin/libexamples_Scpp_Slibhello-lib.dylib ++ local next=/private/tmp/nix-build-bazel-6.4.0.drv-1/_bazel__nixbld1/ff905cebe41cc2b7000dbd8f60c2df58/execroot/io_bazel/bazel-out/darwin-fastbuild/bin/_solib_darwin/libexamples_Scpp_Slibhello-lib.dylib ++ '[' -n /private/tmp/nix-build-bazel-6.4.0.drv-1/_bazel__nixbld1/ff905cebe41cc2b7000dbd8f60c2df58/execroot/io_bazel/bazel-out/darwin-fastbuild/bin/_solib_darwin/libexamples_Scpp_Slibhello-lib.dylib ']' ++ previous=/private/tmp/nix-build-bazel-6.4.0.drv-1/_bazel__nixbld1/ff905cebe41cc2b7000dbd8f60c2df58/execroot/io_bazel/bazel-out/darwin-fastbuild/bin/_solib_darwin/libexamples_Scpp_Slibhello-lib.dylib +++ readlink /private/tmp/nix-build-bazel-6.4.0.drv-1/_bazel__nixbld1/ff905cebe41cc2b7000dbd8f60c2df58/execroot/io_bazel/bazel-out/darwin-fastbuild/bin/_solib_darwin/libexamples_Scpp_Slibhello-lib.dylib ++ next=/private/tmp/nix-build-bazel-6.4.0.drv-1/_bazel__nixbld1/ff905cebe41cc2b7000dbd8f60c2df58/execroot/io_bazel/bazel-out/darwin-fastbuild/bin/examples/cpp/libhello-lib.dylib ++ '[' -n /private/tmp/nix-build-bazel-6.4.0.drv-1/_bazel__nixbld1/ff905cebe41cc2b7000dbd8f60c2df58/execroot/io_bazel/bazel-out/darwin-fastbuild/bin/examples/cpp/libhello-lib.dylib ']' ++ previous=/private/tmp/nix-build-bazel-6.4.0.drv-1/_bazel__nixbld1/ff905cebe41cc2b7000dbd8f60c2df58/execroot/io_bazel/bazel-out/darwin-fastbuild/bin/examples/cpp/libhello-lib.dylib +++ readlink /private/tmp/nix-build-bazel-6.4.0.drv-1/_bazel__nixbld1/ff905cebe41cc2b7000dbd8f60c2df58/execroot/io_bazel/bazel-out/darwin-fastbuild/bin/examples/cpp/libhello-lib.dylib ++ next= + /nix/store/hfwdjfb4203x8rl4s6q565016hi3m0d3-cctools-llvm-11.1.0-973.0.1/bin/install_name_tool -change @loader_path/../../_solib_darwin//libexamples_Scpp_Slibhello-lib.dylib bazel-out/darwin-fastbuild/bin/examples/cpp/hello-success_test ``` note that `@loader_path...` should've been the 2nd argument to `-change` Local host bash version is ``` % bash -version GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin21) Copyright (C) 2007 Free Software Foundation, Inc. ``` Though notably aarch64 host with ``` % bash -version GNU bash, version 3.2.57(1)-release (arm64-apple-darwin22) Copyright (C) 2007 Free Software Foundation, Inc. ``` doesn't hit the issue, so it may be sensitive to some other things too. Bash changelogs do have mentions of bugs around `set -e` https://github.com/bminor/bash/blob/master/CHANGES though I haven't identified the exact version and bug entry.
2319ddc
to
3ff0859
Compare
Submitting upstream here bazelbuild/bazel#20212 Looks like with some bash versions and under some conditions sub-shell would propagate errors to parent shell under `set -e` even though it shouldn't. https://hydra.nixos.org/build/240805256/nixlog/1 https://hydra.nixos.org/build/240805170/nixlog/2 Was able to reproduce the Hydra issue on `x86_64-darwin` with `bazel_6`, but not with `aarch64-darwin` having same numeric bash version. ZHF: NixOS#265948
Is this a difference in behaviour between gnu |
Looks like that's more likely bash version or something other host-specific to blame, potentially some of recent system upgrades. function get_otool_path() {
# the lib path is the path of the original lib relative to the workspace
get_realpath $1 | sed 's|^.*/bazel-out/|bazel-out/|'
} returns empty string (and |
What's weird is that M1 system doesn't hit it but x86-64 does, despite showing same bash numeric version. |
UPDATE: script below actually uses fixed Generally
It does traverse it and print the final path
|
UPDATE: comment below isn't fully correct, diff uses Interesting, something really weird happens here diff --git a/tools/cpp/osx_cc_wrapper.sh.tpl b/tools/cpp/osx_cc_wrapper.sh.tpl
index 8264090c29..3620fffa77 100644
--- a/tools/cpp/osx_cc_wrapper.sh.tpl
+++ b/tools/cpp/osx_cc_wrapper.sh.tpl
@@ -76,14 +76,29 @@ function get_library_path() {
done
}
+
+echo $BASH_VERSION >&2
+set -o >&2
+shopt -o >&2
+PS4="+(prior-exit:$?) "
+trap "echo \"prior-exit:$?\" >&2" DEBUG
+trap "echo \"exit:$?\" >&2" EXIT
+set -x
+
+
# A convenient method to return the actual path even for non symlinks
# and multi-level symlinks.
function get_realpath() {
+ echo $BASH_VERSION >&2
+ set -o >&2
+ shopt -o >&2
+ trap "echo \"prior-exit:$?\" >&2" DEBUG
+ trap "echo \"exit:$?\" >&2" EXIT
local previous="$1"
local next=$(readlink "${previous}")
while [ -n "${next}" ]; do
previous="${next}"
- next=$(readlink "${previous}")
+ next=$(echo $BASH_VERSION >&2; set -o >&2; shopt -o >&2; trap "echo \"prior-exit:$?\" >&2" DEBUG; trap "echo \"exit:$?\" >&2" EXIT; readlink "${previous}" >&2 || true; readlink "${previous}")
done
echo "${previous}"
} with such debug added it seems that
Adding Failing run log ends with following
I think it'd be fair to draft the PR for now as it is too puzzling atm why the patch helps |
New run with more extensive debug prints shows that
So looks like we're back to bash issue hypothesis that for some reason decided to propagate subshell failure (maybe sporadic activation of diff --git a/tools/cpp/osx_cc_wrapper.sh.tpl b/tools/cpp/osx_cc_wrapper.sh.tpl
index 8264090c29..c2de847981 100644
--- a/tools/cpp/osx_cc_wrapper.sh.tpl
+++ b/tools/cpp/osx_cc_wrapper.sh.tpl
@@ -76,15 +76,33 @@ function get_library_path() {
done
}
+
+ulimit -a >&2
+echo $BASH_VERSION >&2
+set -o >&2
+shopt -o >&2
+PS4="+(prior-exit:$?) "
+trap "echo \"prior-exit:\$?\" >&2" DEBUG
+trap "echo \"exit:\$?\" >&2" EXIT
+set -x
+
+
# A convenient method to return the actual path even for non symlinks
# and multi-level symlinks.
function get_realpath() {
+ echo $BASH_VERSION >&2
+ set -o >&2
+ shopt -o >&2
+ trap "echo \"prior-exit:\$?\" >&2" DEBUG
+ trap "echo \"exit:\$?\" >&2" EXIT
local previous="$1"
local next=$(readlink "${previous}")
- while [ -n "${next}" ]; do
+ while echo "while start" >&2; [ -n "${next}" ]; do
previous="${next}"
- next=$(readlink "${previous}")
+ next=$(ulimit -a >&2; echo $BASH_VERSION >&2; set -o >&2; shopt -o >&2; trap "echo \"prior-exit:\$?\" >&2" DEBUG; trap "echo \"exit:\$?\" >&2" EXIT; readlink "${previous}" >&2 || true; readlink "${previous}"; echo "after readlink $?" >&2)
+ echo "after next" >&2
done
+ echo "after while" >&2
echo "${previous}"
} logs
|
And now I'm confused even more $ touch a
$ ln -s a b
$ bash -c -x 'set -eu; a=1; a=$(readlink a); echo ok'
+ set -eu
+ a=1
++ readlink a
+ a=
$ bash -c -x 'set -eu; function f() { local a=1; a=$(readlink a); }; f; echo ok'
+ set -eu
+ f
+ local a=1
++ readlink a
+ a=
$ bash -c -x 'set -e; function f() { local a=b; while [ -n "$a" ]; do a=$(readlink "$a"); done;}; f; echo ok'
+ set -e
+ f
+ local a=b
+ '[' -n b ']'
++ readlink b
+ a=a
+ '[' -n a ']'
++ readlink a
+ a=
$ bash -c -x 'set -e; a=1; a=$(false); echo ok'
+ set -e
+ a=1
++ false
+ a=
This seems to actually be broken everywhere $ bash --version
GNU bash, version 5.2.15(1)-release (x86_64-pc-linux-gnu)
% bash --version
GNU bash, version 3.2.57(1)-release (arm64-apple-darwin22)
% bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin21) But it seems that is was working sometimes on macos 🤔 |
The very first $ bash -c -x 'set -e; function f() { local a=$(readlink a); while [ -n "$a" ]; do a=$(readlink "$a"); done;}; f; echo ok'
+ set -e
+ f
++ readlink a
+ local a=
+ '[' -n '' ']'
+ echo ok
ok
$ bash -c -x 'set -e; function f() { local a=$(readlink b); while [ -n "$a" ]; do a=$(readlink "$a"); done;}; f; echo ok'
+ set -e
+ f
++ readlink b
+ local a=a
+ '[' -n a ']'
++ readlink a
+ a=
So refined hypothesis is it never worked with symlinks. But still not quite clear what determines presence of symlink hops and when did it change / why did it work most of the time so far. Or alternatively, it did work with very old bash version on macos but some of recent upgrades broke it (need to see if can get hold of older bash versions and confirm them vs macos upgrades to test) |
|
Turns out not exactly bazel or bash fault here. The problem is in NixOS shebang replacement that used non-binary interpreter and that isn't guaranteed to work (XNU kernel can deny execution of scripts and calling shell or |
It looks like in some bash versions under some conditions with
set -e
bash may propagate sub-shell errors to parent shell, but it shouldn't asset -E
is the setting about sub-shell error propagation.This manifests itself as
get_realpath
returning empty string.Let's add explicit
|| true
to$(readlink ...)
sub-shell. The loop uses empty string as stop signal and should keep working for both affected and unaffected bash versions.EXTRA INFO
Adding
set -x
to debug it reveals thatnext
is being set to empty string but after thatecho "$previous"
isn't called. Which in turn callsinstall_name_tool
with missing argument and fails the build.Example of the log
https://hydra.nixos.org/build/240805256/nixlog/1
it uses
bazel-6.4.0
to run cpp tests frombazelbuild/examples
inx86_64-darwin
host.Example of local run with
set -x
note that
@loader_path...
should've been the 2nd argument to-change
Local host bash version is
Though notably aarch64 host with
doesn't hit the issue, so it may be sensitive to some other things too.
Bash changelogs do have mentions of bugs around
set -e
https://github.com/bminor/bash/blob/master/CHANGES though I haven't identified the exact version and bug entry.