Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPATHs are wrong. #2

Closed
1 task done
isuruf opened this issue Nov 14, 2023 · 15 comments · Fixed by #5 · May be fixed by #3
Closed
1 task done

RPATHs are wrong. #2

isuruf opened this issue Nov 14, 2023 · 15 comments · Fixed by #5 · May be fixed by #3
Labels
bug Something isn't working

Comments

@isuruf
Copy link
Member

isuruf commented Nov 14, 2023

Solution to issue cannot be found in the documentation.

  • I checked the documentation.

Issue

Since the actual library is in $PREFIX/targets/x86_64-linux/lib the RPATH is set to $ORIGIN/../../../lib.

A fix for this is to reverse the links, i.e. make the actual library $PREFIX/lib/libnvToolsExt.so.1.0.0 and make $PREFIX/targets/x86_64-linux/lib/libnvToolsExt.so.1.0.0 to be a symlink.

Having the wrong RPATH will lead to loading wrong libraries.

cc @peterbell10

Installed packages

# packages in environment at /home/isuru/miniforge3/envs/nvtx:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
cuda-nvtx                 12.0.76              hcb278e6_0    conda-forge
cuda-version              12.0                 hffde075_2    conda-forge
libgcc-ng                 13.2.0               h807b86a_3    conda-forge
libgomp                   13.2.0               h807b86a_3    conda-forge
libstdcxx-ng              13.2.0               h7e041cc_3    conda-forge

Environment info

active environment : nvtx
    active env location : /home/isuru/miniforge3/envs/nvtx
            shell level : 2
       user config file : /home/isuru/.condarc
 populated config files : /home/isuru/miniforge3/.condarc
                          /home/isuru/.condarc
          conda version : 23.10.0
    conda-build version : 3.27.0
         python version : 3.10.13.final.0
       virtual packages : __archspec=1=zen3
                          __glibc=2.35=0
                          __linux=6.2.0=0
                          __unix=0=0
       base environment : /home/isuru/miniforge3  (writable)
      conda av data dir : /home/isuru/miniforge3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /home/isuru/miniforge3/pkgs
                          /home/isuru/.conda/pkgs
       envs directories : /home/isuru/miniforge3/envs
                          /home/isuru/.conda/envs
               platform : linux-64
             user-agent : conda/23.10.0 requests/2.31.0 CPython/3.10.13 Linux/6.2.0-36-generic ubuntu/22.04.3 glibc/2.35 solver/libmamba conda-libmamba-solver/23.11.0 libmambapy/1.5.3
                UID:GID : 1000:1000
             netrc file : /home/isuru/.netrc
           offline mode : False
@isuruf isuruf added the bug Something isn't working label Nov 14, 2023
@isuruf isuruf mentioned this issue Nov 14, 2023
5 tasks
@jakirkham
Copy link
Member

Thanks Isuru! 🙏

Is this only an NVTX issue or did we see this elsewhere?

Also is there some background/context about where this came up? For example is this related to PyTorch's NVTX search logic ( where we had some issues recently conda-forge/pytorch-cpu-feedstock#203 (comment) ) or did this come up somewhere else?

@isuruf
Copy link
Member Author

isuruf commented Nov 15, 2023

I think this issue will pop up for all cuda libraries, but this is the first that I saw.

The issue came up with a local build of pytorch built with latest compilers from conda-forge.

  • Old $HOME/miniconda3/lib/libstdc++.so
  • New $HOME/miniconda3/envs/pytorch/lib/libstdc++.so.

Loading pytorch loads $HOME/miniconda3/envs/pytorch/lib/libnvToolsExt.so which has an RPATH of $ORIGIN/../../../lib and therefore loads $HOME/miniconda3/lib/libstdc++.so. Note that the loading order matters. If libstdc++.so is loaded first before libnvToolsExt.so, then everything works as expected.

@msarahan
Copy link
Member

msarahan commented Nov 15, 2023

I think the fundamental issue here is that we're expecting users to use the library in /lib, while the actual library lives in a hierarchy that makes it easier for us (package maintainers/nvidia/whatever) to know we have the right library at a glance, and so that we match the package behavior relative to other deployments of NVIDIA stuff.

In ye olden days (especially observed with lib and lib64), I think Ray would have said to NOT use lib64, but to keep Conda "pure" in that it need not adopt these conventions that avoid conflicts that Conda does not have. One solution is to move the contents of /targets/x86_64-linux up two levels, and cut the targets/${arch} folders out. Do those folders really serve an important purpose to end users? How much would break if that change is made?

I discussed this with @jakirkham and we think this symlink reversal might make things more confusing in the future. Aside from reworking the package contents structure, here are some other ideas:

  1. remove the symlink and replace it with a hard-link or copy (copy may be necessary). This seems like the safest, most likely to work option.
  2. set build/binary_relocation to false to disable RPATH monkeying
  3. manually specify RPATHS such that both $ORIGIN/../../lib and $ORIGIN/../../../lib are searched with https://docs.conda.io/projects/conda-build/en/stable/resources/define-metadata.html#rpaths

@isuruf
Copy link
Member Author

isuruf commented Nov 15, 2023

Removing targets/x86_64-linux directory structure is not possible as that was done because some headers and libraries have fairly generic names without a cuda or nv prefix that might conflict with other headers.

Option 2 is not a great solution as libnvToolsExt.so wouldn't know where libstdc++.so is.
Option 3 is dangerous as that adds paths that are outside the environment.

My PR does:
4. Reverse the symlinks so that lib/libnvToolsExt.so is not a symlink and the file in targets/x86_64-linux/lib/libnvToolsExt.so is.

I prefer option 4 because targets/x86_64-linux/lib/ is only ever used for building and linking and RPATHs don't matter for those.

@isuruf
Copy link
Member Author

isuruf commented Nov 15, 2023

Option 1 is of course an option, but it would increase the package size/

@bdice
Copy link
Contributor

bdice commented Nov 15, 2023

Option 2 is not a great solution as libnvToolsExt.so wouldn't know where libstdc++.so is.

Can you help me understand this problem in a bit more detail? I may be missing something. A CUDA Toolkit installed to the system (not with conda) doesn't have an RPATH specified for this library. This gives no output:

readelf -d /usr/local/cuda/targets/x86_64-linux/lib/libnvToolsExt.so | grep RPATH

That libnvToolsExt.so does have a NEEDED entry for libstdc++.so.6, like the package we're shipping here. I expected that we would want to ship conda CTK libraries the same way as other CTK distributions, with no RPATH, by setting build/binary_relocation to false.

@isuruf
Copy link
Member Author

isuruf commented Nov 15, 2023

@bdice, that's because a system installation assumes that libstdc++.so comes from the default paths of the dynamic loader. If there's no RPATH, loading pytorch in my local install would load /usr/lib/libstdc++.so.6 instead of the one from the conda env leading to the same problem.

@jakirkham
Copy link
Member

One thing we have been discussing is using stub libraries at build time instead ( conda-forge/cuda-feedstock#8 )

If we changed nvcc to search for the CTK stub libraries first, would that solve the issue?

@isuruf
Copy link
Member Author

isuruf commented Nov 15, 2023

That would mean having the real library in $PREFIX/lib which is what PR #3 does anyway.

@jakirkham
Copy link
Member

Maybe that issue description is a bit misleading. I think we are interested in the other changes there, but that may be further off

Atm am just meaning swapping the order of flags like these

So ${PREFIX}/${targetsDir}/lib/stubs is searched before ${PREFIX}/${targetsDir}/lib

@isuruf
Copy link
Member Author

isuruf commented Nov 15, 2023

I think you are misunderstanding. This issue is not about building or linking. It's about runtime loading.

@msarahan
Copy link
Member

It's about both, because the symlinks are conflating the build time usage with the runtime usage. Any approach that avoids the symlink should be fine, and conceptually cleaner.

Reversing the symlink is fine for making the runtime work, but I think it muddies up the model of where "real" content exists. I think it is important that arch-specific packages continue to contain the "real" stuff (or perhaps the stubs that @jakirkham mentions) in /targets/.../lib, but that a runtime package should contain native libraries in /lib.

Symlinks are something done to save space here, and perhaps avoid a separate -dev and runtime package. I don't think the symlink is a helpful thing in any case, whether or not you reverse it.

@isuruf
Copy link
Member Author

isuruf commented Nov 15, 2023

Ok, so we all agree that

  1. $PREFIX/lib/libnvTools.so.1 should resolve to $PREFIX/lib/libnvTools.so.1.0.0 which should be a real file.
  2. We are not sure what $PREFIX/targets/x86_64-linux/lib/libnvTools.so should be.

Currently, 1 is not true for this feedstock and that's what I'm trying to fix.
For 2, yes, a stub library would be a good option to have, but not strictly necessary IMO.

@msarahan
Copy link
Member

We've been discussing this extensively internally and we definitely think it's a big issue, as it could affect CUDA libraries other than NVTX. The team is working on a document that details possible solutions in the context of CUDA library packages that exist in other forms. We'll file an issue in https://github.com/conda-forge/cuda-feedstock with more info when we've dug in more.

@jakirkham
Copy link
Member

After extensive internal discussion and evaluating the potential options, we have devised a planned course of action and have laid this out in detail in issue: conda-forge/cuda-feedstock#10

Happy to answer any questions that may come up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants