Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTX 30x0 Support #10

Closed
reflare opened this issue Oct 16, 2020 · 13 comments
Closed

RTX 30x0 Support #10

reflare opened this issue Oct 16, 2020 · 13 comments

Comments

@reflare
Copy link

reflare commented Oct 16, 2020

Your current docker image relies on an older version of CUDA.
The current 3080 and 3090 series GPUs are only supported under CUDA 11.1.
It would be wonderful if you could update the image or offer a workaround.

Error message when running with older CUDA:
nvcc fatal : Value 'sm_86' is not defined for option 'gpu-architecture'

@aiXander
Copy link

Are you able to use RTX 30x0 when not using the docker image?

@levindabhi
Copy link

Tf 1.x only support up to Cuda 10.x
Check the compatible version of TF and Cuda here
I guess for using RTX 30x0 one has to build tensorflow from source

@nurpax
Copy link
Contributor

nurpax commented Oct 20, 2020

@reflare Can you try changing this code:

https://github.com/NVlabs/stylegan2-ada/blob/main/dnnlib/tflib/custom_ops.py#L57:

def _get_cuda_gpu_arch_string():
    gpus = [x for x in device_lib.list_local_devices() if x.device_type == 'GPU']
    if len(gpus) == 0:
        raise RuntimeError('No GPU devices found')
    (major, minor) = _get_compute_cap(gpus[0])
    return 'sm_%s%s' % (major, minor)

to:

def _get_cuda_gpu_arch_string():
    return 'sm_70'

I don't have an Ampere card in my machine to test this but it's worth seeing what happens with it. Problems might still remain, as TensorFlow (from the old container) would still be compiled against an old architecture and might be slower than what it could be.

@reflare
Copy link
Author

reflare commented Oct 23, 2020

Are you able to use RTX 30x0 when not using the docker image?

With the current unstable drivers, I am hesitant to roll them out onto my main machine.
For what it's worth, both TF1 and TF2 can be made work with the Nvidia NCG docker containers.

Tf 1.x only support up to Cuda 10.x
Check the compatible version of TF and Cuda here
I guess for using RTX 30x0 one has to build tensorflow from source

Nvidia/tensorflow (as opposed to tensorflow/tensorflow) supports CUDA 11.1 for TF 1.15.

@reflare
Copy link
Author

reflare commented Oct 23, 2020

@reflare Can you try changing this code:

https://github.com/NVlabs/stylegan2-ada/blob/main/dnnlib/tflib/custom_ops.py#L57:

def _get_cuda_gpu_arch_string():
    gpus = [x for x in device_lib.list_local_devices() if x.device_type == 'GPU']
    if len(gpus) == 0:
        raise RuntimeError('No GPU devices found')
    (major, minor) = _get_compute_cap(gpus[0])
    return 'sm_%s%s' % (major, minor)

to:

def _get_cuda_gpu_arch_string():
    return 'sm_70'

I don't have an Ampere card in my machine to test this but it's worth seeing what happens with it. Problems might still remain, as TensorFlow (from the old container) would still be compiled against an old architecture and might be slower than what it could be.

It got surprisingly far on CUDA 11.1 but unfortunately errored out with

2020-10-23 02:52:05.323323: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x5d014a0 2020-10-23 02:52:05.323974: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2020-10-23 02:52:07.613632: F tensorflow/core/kernels/cuda_solvers.cc:94] Check failed: cusolverDnCreate(&cusolver_dn_handle) == CUSOLVER_STATUS_SUCCESS Failed to create cuSolverDN instance. Aborted

I am monitoring memory use on both GPUs and limiting the run to a single GPU just to make sure it's not an issue with parallelism. Memory is not near being exhausted so it appears that the CuSolver cannot be created due to the mismatch in versions.

I have run a second test on CUDA 11.0 and it errors out with the same message. However on CUDA 11.0 it also shows the following:

2020-10-23 02:57:38.895674: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'

@aiXander
Copy link

aiXander commented Oct 23, 2020

Hey @reflare I had that error too, it has to do with the latest bugfix for the NaNs in tensorboard commited by @tkarras
Essentially, the bugfix works when you run without freezed, but when you are using some fixed layers in the discriminator, something goes wrong in the tf.control_dependencies() which triggers this error ---> See #7
So I'd try to run with freezed=0 (default) and see if that also triggers the CUSOLVER_STATUS_SUCCESS Failed to create cuSolverDN instance error!

ps: I would be very interested in hearing how you got this working for Cuda 11.* since I'm also planning on ordering a few RTX 3090 cards myself..

@reflare
Copy link
Author

reflare commented Oct 25, 2020

@tr1pzz Thank you for the input. That said, I didn't get around to testing it because:

Nvidia just released a new NGC image with native support for Tensorflow 1.15 and CUDA 11.1.
Long story short, the code can be run as-is from that image.
I recommend that the official docker file is switched to be based on nvcr.io/nvidia/tensorflow:20.10-tf1-py3 since the official tensorflow TF1 does not intend to support CUDA 11 so Nvidia Tensorflow seems like the logical choice.

nurpax pushed a commit that referenced this issue Oct 26, 2020
This works out of the box on newer NVIDIA GPUs such as RTX 3090.
Note that an NVIDIA driver release r455.23 or later is required to
run this image.

See README for instructions on how to build for older drivers.
@nurpax
Copy link
Contributor

nurpax commented Oct 26, 2020

@reflare Thanks for the bug report and your comments on trying out nvcr.io/nvidia/tensorflow:20.10-tf1-py3!

I pushed a change that defaults to this base image. This requires a pretty new driver (r455.23) to run so I also added a README comment on how to revert back to the old base image.

@nurpax nurpax closed this as completed Oct 26, 2020
8secz-johndpope pushed a commit to johndpope/stylegan2-ada that referenced this issue Dec 29, 2020
8secz-johndpope pushed a commit to johndpope/stylegan2-ada that referenced this issue Dec 29, 2020
@aisdn
Copy link

aisdn commented Jan 20, 2021

@reflare Thanks for the bug report and your comments on trying out nvcr.io/nvidia/tensorflow:20.10-tf1-py3!

I pushed a change that defaults to this base image. This requires a pretty new driver (r455.23) to run so I also added a README comment on how to revert back to the old base image.

@reflare Thanks for the bug report and your comments on trying out nvcr.io/nvidia/tensorflow:20.10-tf1-py3!

I pushed a change that defaults to this base image. This requires a pretty new driver (r455.23) to run so I also added a README comment on how to revert back to the old base image.

Hi, I have finished to run nvcr.io/nvidia/tensorflow:20.10-tf1-py3 to train stylegan2-ada progect on 3090 with cuda11. But I found that the train speed seems to be about twice as slow on 3090 than on 1080ti. I wonder if everyone has the same problem. Are there some ways to increase the training speed in dokcer with 3090?

@johndpope
Copy link

johndpope commented Jan 20, 2021

I got stylegan2-ada working without docker by bumping tensorflow to version 2 in compatibility mode. It’s quite trivial. nvlabs should do this or we pray they release a tensorflow1.16 with cuda 11 support / they probably won’t. You’ll need cuda toolkit 11.2 / driver 460 to get it working. You can use timeshift to save a snapshot of your working config.
https://www.github.com/johndpope/stylegan2-ada

for training mode check the digressions branch / it has one tiny fix.

worosom added a commit to RefikAnadolStudio/stylegan2-ada that referenced this issue Jan 14, 2022
@rubmz
Copy link

rubmz commented Apr 22, 2022

So 2022 ... I got 3090 and can't do stylegan 2 ? (sad face here...)

Because cuda/pytorch 10 will not play well with my hardware...

@jannehellsten
Copy link

jannehellsten commented Apr 23, 2022

This repository (stylegan2-ada) is in fact the TensorFlow version of StyleGAN2 ADA which is completely unsupported and unmaintained.

You should be able to get the pytorch version (stylegan2-ada-pytorch or even more recently, stylegan3) working on 3090. I don't see why something like pytorch 1.11 with CUDA 11.3 wouldn't work on 3090.

@waffletower
Copy link

waffletower commented Jul 17, 2022

I have just given up on using this tensorflow 1.15 based model on RTX 3090. I had great success on an RTX 2080 Ti but have not been able to successfully train with RTX 3090. While still unworkable, I came closest to success upgrading the base image to 22.06-tf1-py3. Augmentation was still broken with the added bonus of cpu-side memory leaks, and training without augmentation, at least with my particular 5000 image dataset, was untenable.

You should be able to get the pytorch version (stylegan2-ada-pytorch or even more recently, stylegan3) working on 3090. I don't see why something like pytorch 1.11 with CUDA 11.3 wouldn't work on 3090.

There are several github issues on stylegan2-ada-pytorch that indicate that the model is not compatible with pytorch 1.11. The code tree does have some explicit commits to support pytorch 1.9. It is unclear whether pytorch 1.10 will play well. The official base image for the repo is still using pytorch 1.8. I am about to switch over and see how successful stylegan2-ada-pytorch is with RTX 3090.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants