Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linking step fails with undefined symbols. #299

Open
CarloWood opened this issue Oct 30, 2023 · 4 comments
Open

Linking step fails with undefined symbols. #299

CarloWood opened this issue Oct 30, 2023 · 4 comments

Comments

@CarloWood
Copy link

After two hours of compiling, the linking step fails! :(

daniel:~/projects/machine-learning/tensorflow_cc/tensorflow_cc/tensorflow_cc/build>make
[ 12%] Performing build step for 'tensorflow_base'
CUDA support enabled
find: ‘/opt/chroots/linuxviewer20230118/root/var/db/sudo’: Permission denied
find: ‘/opt/chroots/linuxviewer20230118/root/var/cache/ldconfig’: Permission denied
find: ‘/opt/chroots/linuxviewer20230118/root/var/cache/private’: Permission denied
find: ‘/opt/chroots/linuxviewer20230118/root/var/log/audit’: Permission denied
find: ‘/opt/chroots/linuxviewer20230118/root/var/log/private’: Permission denied
find: ‘/opt/chroots/linuxviewer20230118/root/var/lib/machines’: Permission denied
find: ‘/opt/chroots/linuxviewer20230118/root/var/lib/portables’: Permission denied
...long list...
find: ‘/usr/local/lost+found’: Permission denied
find: ‘/usr/lost+found’: Permission denied
find: ‘/usr/share/polkit-1/rules.d’: Permission denied
TF_NCCL_VERSION=""   <-- I added these to show that the find doesn't even find anything.
TF_CUDNN_VERSION=""
You have bazel 6.3.2 installed.
Found CUDA 12.2 in:
    /opt/cuda/targets/x86_64-linux/lib
    /opt/cuda/targets/x86_64-linux/include
Found cuDNN 8 in:
    /usr/lib
    /usr/include


Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
        --config=mkl            # Build with MKL support.
        --config=mkl_aarch64    # Build with oneDNN and Compute Library for the Arm Architecture (ACL).
        --config=monolithic     # Config for mostly static monolithic build.
        --config=numa           # Build with NUMA support.
        --config=dynamic_kernels        # (Experimental) Build kernels into separate shared objects.
        --config=v1             # Build with TensorFlow 1 API instead of TF 2 API.
Preconfigured Bazel build configs to DISABLE default on features:
        --config=nogcp          # Disable GCP support.
        --config=nonccl         # Disable NVIDIA NCCL support.
Configuration finished

and then

Starting local Bazel server and connecting to it...
WARNING: while reading option defaults file '/usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc':
  invalid command name 'startup:windows'.
WARNING: The following configs were expanded more than once: [cuda]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=145
INFO: Reading rc options for 'build' from /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc:
  Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc:
  'build' options: --define framework_shared_object=true --define tsl_protobuf_header_only=true --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --features=-force_no_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --define=no_aws_support=true --define=no_hdfs_support=true --experimental_cc_shared_library --experimental_link_static_libraries_once=false --incompatible_enforce_config_setting_visibility
INFO: Reading rc options for 'build' from /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.tf_configure.bazelrc:
  'build' options: --action_env PYTHON_BIN_PATH=/usr/bin/python3 --action_env PYTHON_LIB_PATH=/usr/lib/python3.11/site-packages --python_path=/usr/bin/python3 --action_env TF_CUDA_VERSION=12.2 --action_env TF_CUDNN_VERSION= --action_env TF_NCCL_VERSION= --action_env TF_CUDA_PATHS=/opt/cuda-12.2,/opt/cuda,/usr/local/cuda-12.2,/usr/local/cuda,/usr/local,/usr/cuda-12.2,/usr/cuda,/usr --action_env CUDA_TOOLKIT_PATH=/opt/cuda --action_env NCCL_INSTALL_PATH=/usr --action_env TF_CUDA_COMPUTE_CAPABILITIES=sm_52,sm_53,sm_60,sm_61,sm_62,sm_70,sm_72,sm_75,sm_80,sm_86,compute_86 --action_env GCC_HOST_COMPILER_PATH=/usr/bin/gcc-11 --config=cuda
INFO: Found applicable config definition build:short_logs in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:cuda in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda
INFO: Found applicable config definition build:opt in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.tf_configure.bazelrc: --copt=-march=haswell --host_copt=-march=haswell
INFO: Found applicable config definition build:monolithic in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc: --define framework_shared_object=false --define tsl_protobuf_header_only=false --experimental_link_static_libraries_once=false
INFO: Found applicable config definition build:cuda in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda
INFO: Found applicable config definition build:linux in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc: --host_copt=-w --copt=-Wno-all --copt=-Wno-extra --copt=-Wno-deprecated --copt=-Wno-deprecated-declarations --copt=-Wno-ignored-attributes --copt=-Wno-array-bounds --copt=-Wunused-result --copt=-Werror=unused-result --copt=-Wswitch --copt=-Werror=switch --copt=-Wno-error=unused-but-set-variable --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=dynamic_kernels --experimental_guard_against_concurrent_changes
INFO: Found applicable config definition build:dynamic_kernels in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
WARNING: while reading option defaults file '/usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc':
  invalid command name 'startup:windows'.
WARNING: The following configs were expanded more than once: [cuda]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
INFO: Analyzed 2 targets (471 packages loaded, 38435 targets configured).
INFO: Found 2 targets...

I just ran it again, so everything was already compiled and we go straight to linking again:

ERROR: /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/tensorflow/BUILD:1291:21: Linking tensorflow/libtensorflow_cc.so.2.14.0 failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command (from target //tensorflow:libtensorflow_cc.so.2.14.0) external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc @bazel-out/k8-opt/bin/tensorflow/libtensorflow_cc.so.2.14.0-2.params
/opt/home_carlo/dot_cache/bazel/_bazel_carlo/01a1b20f96784390f57aac7671723885/execroot/org_tensorflow/external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc:44: DeprecationWarning: 'pipes' is deprecated and slated for removal in Python 3.13
  import pipes
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_nextafter_op.pic.lo(gpu_op_next_after.pic.o): in function `tensorflow::(anonymous namespace)::MlirNextAfterGPUDT_FLOATDT_FLOATOp::Invoke(tensorflow::OpKernelContext*, llvm::SmallVectorImpl<tensorflow::UnrankedMemRef>&)':
gpu_op_next_after.cc:(.text._ZN10tensorflow12_GLOBAL__N_134MlirNextAfterGPUDT_FLOATDT_FLOATOp6InvokeEPNS_15OpKernelContextERN4llvm15SmallVectorImplINS_14UnrankedMemRefEEE+0x14): undefined reference to `_mlir_ciface_NextAfter_GPU_DT_FLOAT_DT_FLOAT'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_nextafter_op.pic.lo(gpu_op_next_after.pic.o): in function `tensorflow::(anonymous namespace)::MlirNextAfterGPUDT_DOUBLEDT_DOUBLEOp::Invoke(tensorflow::OpKernelContext*, llvm::SmallVectorImpl<tensorflow::UnrankedMemRef>&)':
gpu_op_next_after.cc:(.text._ZN10tensorflow12_GLOBAL__N_136MlirNextAfterGPUDT_DOUBLEDT_DOUBLEOp6InvokeEPNS_15OpKernelContextERN4llvm15SmallVectorImplINS_14UnrankedMemRefEEE+0x14): undefined reference to `_mlir_ciface_NextAfter_GPU_DT_DOUBLE_DT_DOUBLE'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_nextafter_op.pic.lo(gpu_op_next_after.pic.o): in function `tensorflow::MLIROpKernel<(tensorflow::DataType)1, float, (tensorflow::DataType)1>::Compute(tensorflow::OpKernelContext*)':
gpu_op_next_after.cc:(.text._ZN10tensorflow12MLIROpKernelILNS_8DataTypeE1EfLS1_1EE7ComputeEPNS_15OpKernelContextE[_ZN10tensorflow12MLIROpKernelILNS_8DataTypeE1EfLS1_1EE7ComputeEPNS_15OpKernelContextE]+0x1bc): undefined reference to `_mlir_ciface_NextAfter_GPU_DT_FLOAT_DT_FLOAT'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_nextafter_op.pic.lo(gpu_op_next_after.pic.o): in function `tensorflow::MLIROpKernel<(tensorflow::DataType)2, double, (tensorflow::DataType)2>::Compute(tensorflow::OpKernelContext*)':
gpu_op_next_after.cc:(.text._ZN10tensorflow12MLIROpKernelILNS_8DataTypeE2EdLS1_2EE7ComputeEPNS_15OpKernelContextE[_ZN10tensorflow12MLIROpKernelILNS_8DataTypeE2EdLS1_2EE7ComputeEPNS_15OpKernelContextE]+0x1bc): undefined reference to `_mlir_ciface_NextAfter_GPU_DT_DOUBLE_DT_DOUBLE'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_relu_op.pic.lo(gpu_op_elu.pic.o): in function `tensorflow::(anonymous namespace)::MlirEluGPUDT_HALFDT_HALFOp::Invoke(tensorflow::OpKernelContext*, llvm::SmallVectorImpl<tensorflow::UnrankedMemRef>&)':
gpu_op_elu.cc:(.text._ZN10tensorflow12_GLOBAL__N_126MlirEluGPUDT_HALFDT_HALFOp6InvokeEPNS_15OpKernelContextERN4llvm15SmallVectorImplINS_14UnrankedMemRefEEE+0x10): undefined reference to `_mlir_ciface_Elu_GPU_DT_HALF_DT_HALF'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_relu_op.pic.lo(gpu_op_elu.pic.o): in function `tensorflow::(anonymous namespace)::MlirEluGPUDT_FLOATDT_FLOATOp::Invoke(tensorflow::OpKernelContext*, llvm::SmallVectorImpl<tensorflow::UnrankedMemRef>&)':
gpu_op_elu.cc:(.text._ZN10tensorflow12_GLOBAL__N_128MlirEluGPUDT_FLOATDT_FLOATOp6InvokeEPNS_15OpKernelContextERN4llvm15SmallVectorImplINS_14UnrankedMemRefEEE+0x10): undefined reference to `_mlir_ciface_Elu_GPU_DT_FLOAT_DT_FLOAT'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_relu_op.pic.lo(gpu_op_elu.pic.o): in function `tensorflow::(anonymous namespace)::MlirEluGPUDT_DOUBLEDT_DOUBLEOp::Invoke(tensorflow::OpKernelContext*, llvm::SmallVectorImpl<tensorflow::UnrankedMemRef>&)':
gpu_op_elu.cc:(.text._ZN10tensorflow12_GLOBAL__N_130MlirEluGPUDT_DOUBLEDT_DOUBLEOp6InvokeEPNS_15OpKernelContextERN4llvm15SmallVectorImplINS_14UnrankedMemRefEEE+0x10): undefined reference to `_mlir_ciface_Elu_GPU_DT_DOUBLE_DT_DOUBLE'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_relu_op.pic.lo(gpu_op_elu.pic.o): in function `tensorflow::MLIROpKernel<(tensorflow::DataType)19, Eigen::half, (tensorflow::DataType)19>::Compute(tensorflow::OpKernelContext*)':
gpu_op_elu.cc:(.text._ZN10tensorflow12MLIROpKernelILNS_8DataTypeE19EN5Eigen4halfELS1_19EE7ComputeEPNS_15OpKernelContextE[_ZN10tensorflow12MLIROpKernelILNS_8DataTypeE19EN5Eigen4halfELS1_19EE7ComputeEPNS_15OpKernelContextE]+0x1b8): undefined reference to `_mlir_ciface_Elu_GPU_DT_HALF_DT_HALF'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_relu_op.pic.lo(gpu_op_relu.pic.o): in function `tensorflow::(anonymous namespace)::MlirReluGPUDT_HALFDT_HALFOp::Invoke(tensorflow::OpKernelContext*, llvm::SmallVectorImpl<tensorflow::UnrankedMemRef>&)':
gpu_op_relu.cc:(.text._ZN10tensorflow12_GLOBAL__N_127MlirReluGPUDT_HALFDT_HALFOp6InvokeEPNS_15OpKernelContextERN4llvm15SmallVectorImplINS_14UnrankedMemRefEEE+0x10): undefined reference to `_mlir_ciface_Relu_GPU_DT_HALF_DT_HALF'
... and so on (very very long list)...
gpu_op_zeta.cc:(.text._ZN10tensorflow12_GLOBAL__N_131MlirZetaGPUDT_DOUBLEDT_DOUBLEOp6InvokeEPNS_15OpKernelContextERN4llvm15SmallVectorImplINS_14UnrankedMemRefEEE+0x14): undefined reference to `_mlir_ciface_Zeta_GPU_DT_DOUBLE_DT_DOUBLE'
collect2: error: ld returned 1 exit status
INFO: Elapsed time: 60.283s, Critical Path: 52.06s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully
make[2]: *** [CMakeFiles/tensorflow_base.dir/build.make:87: tensorflow-stamp/tensorflow_base-build] Error 1
make[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/tensorflow_base.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

Can you please give me a hint, or ask me to test something?

Note that I made the following change:

diff --git a/tensorflow_cc/PROJECT_VERSION b/tensorflow_cc/PROJECT_VERSION
index c8e38b6..edcfe40 100644
--- a/tensorflow_cc/PROJECT_VERSION
+++ b/tensorflow_cc/PROJECT_VERSION
@@ -1 +1 @@
-2.9.0
+2.14.0

This is the only thing I changed.

@CarloWood
Copy link
Author

All 632 (unique) symbols that are undefined start with _mlir_ciface_*.

@CarloWood
Copy link
Author

All 649 error lines containing 'undefined reference to' are of the following form:

^gpu_op_[a-z0-9_]*\.cc:(\.text\._Z[^+]*+0x[0-9a-f]*): undefined reference to `_mlir_ciface_[A-Za-z0-9_]*.$

showing that all undefined references come from files with a name like gpu_op_[a-z0-9_]*\.cc.
All of which exclusively exist in build/tensorflow/tensorflow/core/kernels/mlir_generated/.

196 of the errors are generated from gpu_op_cast.cc (the second one is gpu_op_relu.cc with 17 errors).

The only file with only a single error are gpu_op_logical_and.cc, gpu_op_logical_not.cc and gpu_op_logical_or.cc.
These three files use GENERATE_BINARY_GPU_KERNEL and REGISTER_GPU_KERNEL_NO_TYPE_CONSTRAINT each once.

From which it seems that GENERATE_BINARY_GPU_KERNEL and GENERATE_UNARY_GPU_KERNEL --- OR REGISTER_GPU_KERNEL_NO_TYPE_CONSTRAINT produces an error.

The files that generate two errors are: gpu_op_angle.cc, gpu_op_complex_abs.cc, gpu_op_complex.cc, gpu_op_conj.cc, gpu_op_imag.cc, gpu_op_polygamma.cc, gpu_op_real.cc and gpu_op_zeta.cc.

From which it seems that an error is produced by
REGISTER_COMPLEX_GPU_KERNEL,
GENERATE_AND_REGISTER_UNARY_GPU_KERNEL and
GENERATE_AND_REGISTER_BINARY_GPU_KERNEL.

To make a long story short, it seems that the problem comes from the use of macros that use the macro MLIR_FUNCTION defined in tensorflow/tensorflow/core/kernels/mlir_generated/base_op.h:

#define MLIR_FUNCTION(tf_op, platform, input_type, output_type) \
  _mlir_ciface_##tf_op##_##platform##_##input_type##_##output_type

and well in particular:
GENERATE_UNARY_KERNEL3, GENERATE_BINARY_KERNEL3 and GENERATE_TERNARY_KERNEL3 which are more or less similar, so l lets just look at one:

#define GENERATE_UNARY_KERNEL3(tf_op, platform, input_type, output_type, casted_input_type, casted_output_type)

which produces code like (I did some formatting):

extern "C" void MLIR_FUNCTION(tf_op, platform, input_type, output_type)              // <-- Undefined reference.                                 
    (UnrankedMemRef * result, OpKernelContext * ctx, UnrankedMemRef * arg);     
                                                                              
namespace {                                                                   
                                                                              
class MLIR_OP(tf_op, platform, casted_input_type, casted_output_type) :                                                                          
    public MLIROpKernel<output_type, typename EnumToDataType<output_type>::Type, casted_output_type>
{                                                                             
 public:                                                                        
  using MLIROpKernel::MLIROpKernel;
                                                                                
  UnrankedMemRef Invoke(OpKernelContext* ctx, llvm::SmallVectorImpl<UnrankedMemRef>& args) override
  {
    UnrankedMemRef result;                                                                           
    MLIR_FUNCTION(tf_op, platform, input_type, output_type)(&result, ctx, &args[0]);   // <-- Undefined reference.
    return result;
  }                                                                                                                                 
};                                                                            
                                                                              
} // namespace 

@CarloWood
Copy link
Author

I found out it is an upstream problem. As of 2.14 they aren't linking with the (634 generated) bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/lib*_kernel_generator.pic.a archives.

@CarloWood
Copy link
Author

If you use bazel 6.1.0 it works. Then something else breaks, but this is a monologue anyway. Goodbye.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant