Skip to content

Latest commit

 

History

History
112 lines (83 loc) · 4.24 KB

fedora39-tensorflow-gpu.md

File metadata and controls

112 lines (83 loc) · 4.24 KB

Tensorflow doesnt work on Cuda 12.4 and Cudnn 9.x as of 24-03-2024. So don't follow these as of now

gratitude for providing an easy to use cuda and cudnn installation : https://negativo17.org/nvidia-driver/

  1. Add negativo17/fedora.nvidia and rpmfusion-nonfree-release repos
sudo dnf config-manager --add-repo=https://negativo17.org/repos/fedora-nvidia.repo
sudo dnf install \
  https://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
sudo dnf upgrade
  1. Remove any existing nvidia thing you have (knowing what you're doing before running this command is a bonus)
sudo dnf remove *nvidia*
  1. Install nvidia-driver
sudo dnf install nvidia-driver 
  1. Install cuda drivers and cuda development packages and remove xorg-x11-drv-nvidia-cuda and xorg-x11-drv-nvidia-cuda-libs from rpmfusion-nonfree
sudo dnf remove xorg-x11-drv-nvidia-cuda
sudo dnf remove xorg-x11-drv-nvidia-cuda-libs
sudo dnf install nvidia-driver-cuda
sudo dnf install cuda-devel
sudo dnf remove xorg-x11-drv-nvidia-cuda
sudo dnf remove xorg-x11-drv-nvidia-cuda-libs
  1. Install cudnn and cudnn development packages
sudo dnf install cuda-cudnn.x86_64 cuda-cudnn-devel.x86_64
  1. If you are on the latest version of Fedora (I am on 40 as of writing this), the default version of Python would be the latest version which won't support tensorflow. So install an older version.
sudo dnf install python310 
python3.10 -m ensurepip
python3.10 -m pip install tensorrt #tensorrt doesnt exactly work on fedora 40 on latest nvidia drivers even with nvidia's official tensorrt package on their website but its good to have this module (i mean i couldnt make it work atleast)
  1. cd to your python project path
virtualenv -p `which python3.10` .venv 
source .venv/bin/activate
pip uninstall tensorflow[and-cuda]
sudo dnf install mlocate
sudo updatedb
cp $(locate libdevice.10.bc) .

Congratulations! You can now use tensorflow with CUDA on Fedora 40!

Footnotes:

  1. Sometimes Nvidia UVM bugs out so just remove the module right before using anything that uses CUDA or cuDNN. The kernel will reload it automatically when you run a program that requires it. Sometimes it still fails for no reason. In that case just spam the command a few times on the terminal (I don't know how but it works)
sudo modprobe -r nvidia_uvm #you might need this sometimes to reload
  1. If you are recieving errors such as 2024-01-10 23:40:45.421281: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered, install tf-nightly[and-cuda]
# Assuming you are on a virtual environment
pip uninstall tensorflow[and-cuda]
pip install tf-nightly[and-cuda]
  1. If you are recieving warnings like successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355, this is caused by a bug that resets numa_node to -1 after reboots :

    Fix 1 (temporary, needs to be run manually after every reboot, recommended if you are new to linux)

    1. Identify the PCI-ID of your GPU
    2. Set 0 to `/sys/bus/pci/devices/<PCI_ID>/numa_node

    For example

    karthik@fedora:~$ lspci -D | grep NVIDIA
    0000:01:00.0 VGA compatible controller: NVIDIA Corporation AD107M [GeForce RTX 4060 Max-Q / Mobile] (rev a1)
    0000:01:00.1 Audio device: NVIDIA Corporation Device 22be (rev a1)
    karthik@fedora:~$ sudo su
    [sudo] password for karthik: 
    root@fedora:/home/karthik# echo 0 | tee -a "/sys/bus/pci/devices/0000:01:00.0/numa_node"
    0

    Fix 2 (same as Fix 1 but permanent) credit

    # 1) Identify the PCI-ID (with domain) of your GPU
    #    For example: PCI_ID="0000.81:00.0"
    lspci -D | grep NVIDIA
    # 2) Add a crontab for root
    sudo crontab -e
    #    Add the following line
    @reboot (echo 0 | tee -a "/sys/bus/pci/devices/<PCI_ID>/numa_node")
  2. TODO: figure out how to make tensorrt work