Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in interaction between thrust code on main program and shared object #736

Closed
hwinkler opened this issue May 20, 2020 · 11 comments
Closed
Assignees
Labels
thrust For all items related to Thrust.

Comments

@hwinkler
Copy link

hwinkler commented May 20, 2020

I have a weird reproducible failure that occurs when I have a main program using Thrust, loading a Linux shared object library that also uses Thrust.

The error occurs only when I compile with -O0 or -O1.

Here's the main program, file test.cu

#include <thrust/device_vector.h>

// external function from the shared lib:
thrust::device_vector<float> sortbug();

//The main program doesn't even call the function `ifMerelyPresentMainWillFail`.
// Yet if you omit this function definition, the assert in the main program 
// will succeed.
int ifMerelyPresentMainWillFail()
{
    const auto probs = thrust::device_vector<float>(99,0);

    return thrust::reduce(probs.begin(), probs.end());
}


int main (void)
{
    // sortbug() calls thrust::sort_by_key using some floats as keys.
    // It is supposed to sort the keys as a side effect. It returns the keys
    // here. The following assert fails when compiled with -O0 or -O1:

    thrust::device_vector<float> p = sortbug();
    for (int i=1; i<p.size(); i++) {
        assert(p[i] >= p[i-1]);
    }

    // At first I thought the reason the behavior only happens using -O0 or -O1
    // was related to the compiler optimizing away `ifMerelyPresentMainWillFail()`. 
    // However that seems not to be the case. If you uncomment the next line
    // the program still fails under -O0 and -O1.

    // return ifMerelyPresentMainWillFail();
    return 0;
}

It calls into a a function, sortbug(), that lives in a separate .so file:

#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <array>

thrust::device_vector<float>  sortbug()
{
    std::array<float, 10> p = { 3.37868e-07, 2.29231e-05, 5.97704e-04, 6.05975e-03, 2.41730e-02,
                              3.82925e-02, 2.41730e-02, 6.05975e-03, 5.97704e-04, 2.29231e-05 };

    std::array<size_t, 10> idx = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
    thrust::device_vector<size_t> indexes(idx.begin(), idx.end());
    thrust::device_vector<float> probs(p.begin(), p.end());

    // Sort probs into ascending order; remember their original indexes
    thrust::sort_by_key(probs.begin(), probs.end(), indexes.begin());

    return probs;
}

If I elide the unused function ifMerelyPresentMainWillFail() from test.cu, the test program succeeds. Also if I compile everything using -O3 or -O2, it succeeds. And if I move the code in the shared object to the main program file, it all succeeds.

But if I don't do any of that, the program fails, not the same way in all cases.

In some cases, the assert in test.cu fails: the sortbug() function ran, but did not sort the keys.

In other cases, the program throws a thrust::system::system_error,
transform: failed to synchronize: cudaErrorInvalidConfiguration: invalid configuration argument

Here's the Makefile:

NVCC=/usr/lib/cuda-10.2/bin/nvcc
CCBIN=-ccbin /usr/bin/g++-8

OPTIMIZATION=-O0


GENCODE= -gencode arch=compute_60,code=compute_60 -gencode arch=compute_60,code=sm_60 

COMPILE_OPTIONS= -std=c++14 -Xcompiler -fPIC -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA $(GENCODE) $(OPTIMIZATION)

all: clean test

lib: libbug.so

clean:
	rm -rf *.o test


%.o: %.cu
	$(NVCC) $(CCBIN) $(COMPILE_OPTIONS) --compile -o  "$@" "$<"

libbug.so: sort.o
	$(NVCC) --cudart static -shared $(GENCODE) -link -o libbug.so sort.o

test: libbug.so test.o
	$(NVCC) --cudart static $(GENCODE) -L. -link -o test test.o -lbug
@hwinkler hwinkler changed the title Interaction between thrust code on main program and shared object Faiure in interaction between thrust code on main program and shared object May 20, 2020
@hwinkler
Copy link
Author

sortbug.tar.gz

Attached repro files

@hwinkler hwinkler changed the title Faiure in interaction between thrust code on main program and shared object Failure in interaction between thrust code on main program and shared object May 20, 2020
@alliepiper
Copy link
Collaborator

I'm not actually sure if this is supported, but I'll ask around internally. Dynamic linking with CUDA device code can be fickle, from what I remember.

IIRC, you may be able to work around this by statically linking all libraries that use CUDA together and then dynamically linking with the result, but it's been a while since I've run into this.

@hwinkler
Copy link
Author

hwinkler commented May 21, 2020

We build a library into a .so and call it from Python -- I sure hope that's a supported profile.

We write the lib's unit tests in C++, and link to the built .so, and that's where we run into this problem. We could work around it I guess by linking statically to the object files. As long as the Python + .so use case is supported, that's OK.

@jrhemstad jrhemstad added the thrust For all items related to Thrust. label Feb 22, 2023
@elstehle
Copy link
Collaborator

Thanks for reporting the issue, @hwinkler. This reads similar to issue NVIDIA/cub#545 that has been fixed in PR NVIDIA/cub#547. Do you have a chance to see if the issue has been resolved in the meanwhile?

@jarmak-nv jarmak-nv transferred this issue from NVIDIA/thrust Nov 8, 2023
@jrhemstad
Copy link
Collaborator

@hwinkler this should be resolved by #443 which will be part of the upcoming 2.3 release. Please reopen the issue if you are still seeing problems.

@wrvsrx
Copy link

wrvsrx commented Jan 9, 2024

@jrhemstad I can still reproduce this bug on b4d490b (branch/2.3.x), maybe this issue should be reopened.

Here's the code for reproducing test.zip:

  1. put the code into cccl directory
  2. run bash test.sh

@jrhemstad
Copy link
Collaborator

Thanks @wrvsrx , @gevtushenko will look into it!

@gevtushenko gevtushenko self-assigned this Jan 9, 2024
@gevtushenko
Copy link
Collaborator

@wrvsrx the commit you mentioned seems to be from main branch rather than branch/2.3.x
I had to add -Xcompiler "-fPIC" to the first instruction, otherwise I couldn't compile your reproducer using GCC 11.4.
On both main and branch/2.3.x, I'm unable to see any issues using your reproducer.
I tried using https://github.com/rapidsai/detect-weak-linking to detect linkage issues, and there are none.

If you still can see the issue, please, add std::cout << THRUST_VERSION << std::endl; to the reproducer and attach the output with expected error along with the GPU you are running on, driver version, host compiler version and the CUDA SDK version.

@wrvsrx
Copy link

wrvsrx commented Jan 10, 2024

Sorry, I test the result on main rather than branch/2.3.x

  • gpu: GTX 1080ti
  • gpu driver: 545.29.06
  • cuda version: 12.3.52
  • host compiler: gcc 11.4.0
  • build reproduce step:
    1. download test2.zip and unzip it (I add code to dump THRUST_VERSION)
    2. place in into cccl directory to get following structure
├── CITATION.md
├── CMakeLists.txt
├── CMakePresets.json
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── SECURITY.md
├── benchmarks
├── build
├── ci
├── ci-overview.md
├── cmake
├── cub
├── docs
├── examples
├── lib
├── lib.cu     <- new file 
├── lib.cuh    <- new file
├── libcudacxx
├── main.cu    <- new file
├── test
├── test.sh    <- new file
└── thrust
  1. bash test.sh
  2. output:
THRUST_VERSION: 200300
terminate called after throwing an instance of 'thrust::THRUST_200300_520_NS::system::system_error'
  what():  after dispatching exclusive_scan kernel: cudaErrorInvalidDeviceFunction: invalid device function
test.sh: line 3: 33763 Aborted                 (core dumped) ./main
  • raw
~/Documents/cccl main *1 ?7 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ✘ ABRT 12s ▼  impure 08:57:29
❯ nvidia-smi
Wed Jan 10 08:57:32 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off | 00000000:67:00.0  On |                  N/A |
| 24%   42C    P8              20W / 250W |    278MiB / 11264MiB |     15%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2398      G   ...cgy4lr4r1-xorg-server-21.1.10/bin/X      154MiB |
|    0   N/A  N/A      3153      G   ...seed-version=20240107-180120.236000       97MiB |
|    0   N/A  N/A      3431      G   ...712-072601-f4abf8fd/bin/wezterm-gui       21MiB |
+---------------------------------------------------------------------------------------+
~/Documents/cccl main *1 ?7 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ▼  impure 08:57:32
❯ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
~/Documents/cccl main *1 ?7 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ▼  impure 08:57:36
❯ gcc --version
gcc (GCC) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

~/Documents/cccl main *1 ?7 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ▼  impure 08:57:38
❯ LD_LIBRARY_PATH=/run/opengl-driver/lib bash test.sh
THRUST_VERSION: 200300
terminate called after throwing an instance of 'thrust::THRUST_200300_520_NS::system::system_error'
  what():  after dispatching exclusive_scan kernel: cudaErrorInvalidDeviceFunction: invalid device function
test.sh: line 3: 34201 Aborted                 (core dumped) ./main

@gevtushenko
Copy link
Collaborator

@wrvsrx thank you for the info. I'm unable to reproduce the issue using equivalent setup. A few follow up questions:

  1. Can you still see the issue when specifying architecture --generate-code=arch=compute_61,code=[compute_61,sm_61]?
  2. Is there an issue without LD_LIBRARY_PATH=/run/opengl-driver/lib when launching bash test.sh?
  3. Does weak linkage detector see anything wrong on your end? python3 detect-weak-linking/detect.py -m main libtest.so

@wrvsrx
Copy link

wrvsrx commented Jan 16, 2024

@gevtushenko

I test this behavior in some other systems, I found this problem only happens when I create environments using nix. When I use nix to create environment on Ubuntu 22.04 or NixOS, I can reproduce the problem. However, if I use nvcc installed globally on ubuntu, this problem disappear. Maybe I should post the problem to cuda team of NixOS. Sorry for taking up your time.

For your information, I can still see this issue when after specifying architecture. Setting LD_LIBRARY_PATH is nessesary to run gpu program on NixOS. And weak linkage detector doesn't report any wrong.

If you have interest to test this problem under environment under nix, here are the steps:

  1. install nix and direnv
  2. enable nix flake by adding experimental-features = nix-command flakes to /etc/nix/nix.conf
  3. eval "$(direnv hook bash)" (supposing you use bash, otherwise refer to https://direnv.net/docs/hook.html)
  4. git clone https://github.com/wrvsrx/cccl.git
  5. cd cccl
  6. git checkout reproduce-link-problem
  7. direnv allow. This will set up environment for build with nvcc and gcc installed by nix
  8. NIXPKGS_ALLOW_UNFREE=1 nix run --impure github:guibou/nixGL -- bash test.sh. This will set proper environment variable about driver to run gpu program on non-NixOS system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
thrust For all items related to Thrust.
Projects
Archived in project
Development

No branches or pull requests

6 participants