Failure in interaction between thrust code on main program and shared object #736

hwinkler · 2020-05-20T21:00:20Z

I have a weird reproducible failure that occurs when I have a main program using Thrust, loading a Linux shared object library that also uses Thrust.

The error occurs only when I compile with -O0 or -O1.

Here's the main program, file test.cu

#include <thrust/device_vector.h>

// external function from the shared lib:
thrust::device_vector<float> sortbug();

//The main program doesn't even call the function `ifMerelyPresentMainWillFail`.
// Yet if you omit this function definition, the assert in the main program 
// will succeed.
int ifMerelyPresentMainWillFail()
{
    const auto probs = thrust::device_vector<float>(99,0);

    return thrust::reduce(probs.begin(), probs.end());
}


int main (void)
{
    // sortbug() calls thrust::sort_by_key using some floats as keys.
    // It is supposed to sort the keys as a side effect. It returns the keys
    // here. The following assert fails when compiled with -O0 or -O1:

    thrust::device_vector<float> p = sortbug();
    for (int i=1; i<p.size(); i++) {
        assert(p[i] >= p[i-1]);
    }

    // At first I thought the reason the behavior only happens using -O0 or -O1
    // was related to the compiler optimizing away `ifMerelyPresentMainWillFail()`. 
    // However that seems not to be the case. If you uncomment the next line
    // the program still fails under -O0 and -O1.

    // return ifMerelyPresentMainWillFail();
    return 0;
}

It calls into a a function, sortbug(), that lives in a separate .so file:

#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <array>

thrust::device_vector<float>  sortbug()
{
    std::array<float, 10> p = { 3.37868e-07, 2.29231e-05, 5.97704e-04, 6.05975e-03, 2.41730e-02,
                              3.82925e-02, 2.41730e-02, 6.05975e-03, 5.97704e-04, 2.29231e-05 };

    std::array<size_t, 10> idx = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
    thrust::device_vector<size_t> indexes(idx.begin(), idx.end());
    thrust::device_vector<float> probs(p.begin(), p.end());

    // Sort probs into ascending order; remember their original indexes
    thrust::sort_by_key(probs.begin(), probs.end(), indexes.begin());

    return probs;
}

If I elide the unused function ifMerelyPresentMainWillFail() from test.cu, the test program succeeds. Also if I compile everything using -O3 or -O2, it succeeds. And if I move the code in the shared object to the main program file, it all succeeds.

But if I don't do any of that, the program fails, not the same way in all cases.

In some cases, the assert in test.cu fails: the sortbug() function ran, but did not sort the keys.

In other cases, the program throws a thrust::system::system_error,
transform: failed to synchronize: cudaErrorInvalidConfiguration: invalid configuration argument

Here's the Makefile:

NVCC=/usr/lib/cuda-10.2/bin/nvcc
CCBIN=-ccbin /usr/bin/g++-8

OPTIMIZATION=-O0


GENCODE= -gencode arch=compute_60,code=compute_60 -gencode arch=compute_60,code=sm_60 

COMPILE_OPTIONS= -std=c++14 -Xcompiler -fPIC -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA $(GENCODE) $(OPTIMIZATION)

all: clean test

lib: libbug.so

clean:
	rm -rf *.o test


%.o: %.cu
	$(NVCC) $(CCBIN) $(COMPILE_OPTIONS) --compile -o  "$@" "$<"

libbug.so: sort.o
	$(NVCC) --cudart static -shared $(GENCODE) -link -o libbug.so sort.o

test: libbug.so test.o
	$(NVCC) --cudart static $(GENCODE) -L. -link -o test test.o -lbug

The text was updated successfully, but these errors were encountered:

hwinkler · 2020-05-20T21:01:59Z

sortbug.tar.gz

Attached repro files

alliepiper · 2020-05-21T15:30:45Z

I'm not actually sure if this is supported, but I'll ask around internally. Dynamic linking with CUDA device code can be fickle, from what I remember.

IIRC, you may be able to work around this by statically linking all libraries that use CUDA together and then dynamically linking with the result, but it's been a while since I've run into this.

hwinkler · 2020-05-21T15:41:32Z

We build a library into a .so and call it from Python -- I sure hope that's a supported profile.

We write the lib's unit tests in C++, and link to the built .so, and that's where we run into this problem. We could work around it I guess by linking statically to the object files. As long as the Python + .so use case is supported, that's OK.

elstehle · 2023-02-24T17:08:16Z

Thanks for reporting the issue, @hwinkler. This reads similar to issue NVIDIA/cub#545 that has been fixed in PR NVIDIA/cub#547. Do you have a chance to see if the issue has been resolved in the meanwhile?

jrhemstad · 2023-11-08T19:19:04Z

@hwinkler this should be resolved by #443 which will be part of the upcoming 2.3 release. Please reopen the issue if you are still seeing problems.

wrvsrx · 2024-01-09T07:16:51Z

@jrhemstad I can still reproduce this bug on b4d490b (branch/2.3.x), maybe this issue should be reopened.

Here's the code for reproducing test.zip:

put the code into cccl directory
run bash test.sh

jrhemstad · 2024-01-09T18:03:38Z

Thanks @wrvsrx , @gevtushenko will look into it!

gevtushenko · 2024-01-09T18:39:46Z

@wrvsrx the commit you mentioned seems to be from main branch rather than branch/2.3.x
I had to add -Xcompiler "-fPIC" to the first instruction, otherwise I couldn't compile your reproducer using GCC 11.4.
On both main and branch/2.3.x, I'm unable to see any issues using your reproducer.
I tried using https://github.com/rapidsai/detect-weak-linking to detect linkage issues, and there are none.

If you still can see the issue, please, add std::cout << THRUST_VERSION << std::endl; to the reproducer and attach the output with expected error along with the GPU you are running on, driver version, host compiler version and the CUDA SDK version.

wrvsrx · 2024-01-10T01:06:58Z

Sorry, I test the result on main rather than branch/2.3.x

gpu: GTX 1080ti
gpu driver: 545.29.06
cuda version: 12.3.52
host compiler: gcc 11.4.0
build reproduce step:
1. download test2.zip and unzip it (I add code to dump THRUST_VERSION)
2. place in into cccl directory to get following structure

├── CITATION.md
├── CMakeLists.txt
├── CMakePresets.json
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── SECURITY.md
├── benchmarks
├── build
├── ci
├── ci-overview.md
├── cmake
├── cub
├── docs
├── examples
├── lib
├── lib.cu     <- new file 
├── lib.cuh    <- new file
├── libcudacxx
├── main.cu    <- new file
├── test
├── test.sh    <- new file
└── thrust

bash test.sh
output:

THRUST_VERSION: 200300
terminate called after throwing an instance of 'thrust::THRUST_200300_520_NS::system::system_error'
  what():  after dispatching exclusive_scan kernel: cudaErrorInvalidDeviceFunction: invalid device function
test.sh: line 3: 33763 Aborted                 (core dumped) ./main

raw

~/Documents/cccl main *1 ?7 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ✘ ABRT 12s ▼  impure 08:57:29
❯ nvidia-smi
Wed Jan 10 08:57:32 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off | 00000000:67:00.0  On |                  N/A |
| 24%   42C    P8              20W / 250W |    278MiB / 11264MiB |     15%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2398      G   ...cgy4lr4r1-xorg-server-21.1.10/bin/X      154MiB |
|    0   N/A  N/A      3153      G   ...seed-version=20240107-180120.236000       97MiB |
|    0   N/A  N/A      3431      G   ...712-072601-f4abf8fd/bin/wezterm-gui       21MiB |
+---------------------------------------------------------------------------------------+
~/Documents/cccl main *1 ?7 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ▼  impure 08:57:32
❯ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
~/Documents/cccl main *1 ?7 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ▼  impure 08:57:36
❯ gcc --version
gcc (GCC) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

~/Documents/cccl main *1 ?7 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ▼  impure 08:57:38
❯ LD_LIBRARY_PATH=/run/opengl-driver/lib bash test.sh
THRUST_VERSION: 200300
terminate called after throwing an instance of 'thrust::THRUST_200300_520_NS::system::system_error'
  what():  after dispatching exclusive_scan kernel: cudaErrorInvalidDeviceFunction: invalid device function
test.sh: line 3: 34201 Aborted                 (core dumped) ./main

gevtushenko · 2024-01-13T23:05:10Z

@wrvsrx thank you for the info. I'm unable to reproduce the issue using equivalent setup. A few follow up questions:

Can you still see the issue when specifying architecture --generate-code=arch=compute_61,code=[compute_61,sm_61]?
Is there an issue without LD_LIBRARY_PATH=/run/opengl-driver/lib when launching bash test.sh?
Does weak linkage detector see anything wrong on your end? python3 detect-weak-linking/detect.py -m main libtest.so

wrvsrx · 2024-01-16T10:37:24Z

@gevtushenko

I test this behavior in some other systems, I found this problem only happens when I create environments using nix. When I use nix to create environment on Ubuntu 22.04 or NixOS, I can reproduce the problem. However, if I use nvcc installed globally on ubuntu, this problem disappear. Maybe I should post the problem to cuda team of NixOS. Sorry for taking up your time.

For your information, I can still see this issue when after specifying architecture. Setting LD_LIBRARY_PATH is nessesary to run gpu program on NixOS. And weak linkage detector doesn't report any wrong.

If you have interest to test this problem under environment under nix, here are the steps:

install nix and direnv
enable nix flake by adding experimental-features = nix-command flakes to /etc/nix/nix.conf
eval "$(direnv hook bash)" (supposing you use bash, otherwise refer to https://direnv.net/docs/hook.html)
git clone https://github.com/wrvsrx/cccl.git
cd cccl
git checkout reproduce-link-problem
direnv allow. This will set up environment for build with nvcc and gcc installed by nix
NIXPKGS_ALLOW_UNFREE=1 nix run --impure github:guibou/nixGL -- bash test.sh. This will set proper environment variable about driver to run gpu program on non-NixOS system.

hwinkler changed the title ~~Interaction between thrust code on main program and shared object~~ Faiure in interaction between thrust code on main program and shared object May 20, 2020

hwinkler changed the title ~~Faiure in interaction between thrust code on main program and shared object~~ Failure in interaction between thrust code on main program and shared object May 20, 2020

jrhemstad added the thrust For all items related to Thrust. label Feb 22, 2023

jarmak-nv transferred this issue from NVIDIA/thrust Nov 8, 2023

jrhemstad closed this as completed Nov 8, 2023

gevtushenko self-assigned this Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in interaction between thrust code on main program and shared object #736

Failure in interaction between thrust code on main program and shared object #736

hwinkler commented May 20, 2020 •

edited

Loading

hwinkler commented May 20, 2020

alliepiper commented May 21, 2020

hwinkler commented May 21, 2020 •

edited

Loading

elstehle commented Feb 24, 2023

jrhemstad commented Nov 8, 2023

wrvsrx commented Jan 9, 2024

jrhemstad commented Jan 9, 2024

gevtushenko commented Jan 9, 2024

wrvsrx commented Jan 10, 2024

gevtushenko commented Jan 13, 2024

wrvsrx commented Jan 16, 2024

Failure in interaction between thrust code on main program and shared object #736

Failure in interaction between thrust code on main program and shared object #736

Comments

hwinkler commented May 20, 2020 • edited Loading

hwinkler commented May 20, 2020

alliepiper commented May 21, 2020

hwinkler commented May 21, 2020 • edited Loading

elstehle commented Feb 24, 2023

jrhemstad commented Nov 8, 2023

wrvsrx commented Jan 9, 2024

jrhemstad commented Jan 9, 2024

gevtushenko commented Jan 9, 2024

wrvsrx commented Jan 10, 2024

gevtushenko commented Jan 13, 2024

wrvsrx commented Jan 16, 2024

hwinkler commented May 20, 2020 •

edited

Loading

hwinkler commented May 21, 2020 •

edited

Loading