RTX 30x0 Support #10

reflare · 2020-10-16T09:19:07Z

Your current docker image relies on an older version of CUDA.
The current 3080 and 3090 series GPUs are only supported under CUDA 11.1.
It would be wonderful if you could update the image or offer a workaround.

Error message when running with older CUDA:
nvcc fatal : Value 'sm_86' is not defined for option 'gpu-architecture'

The text was updated successfully, but these errors were encountered:

aiXander · 2020-10-18T13:23:26Z

Are you able to use RTX 30x0 when not using the docker image?

levindabhi · 2020-10-20T08:06:31Z

Tf 1.x only support up to Cuda 10.x
Check the compatible version of TF and Cuda here
I guess for using RTX 30x0 one has to build tensorflow from source

nurpax · 2020-10-20T12:02:00Z

@reflare Can you try changing this code:

https://github.com/NVlabs/stylegan2-ada/blob/main/dnnlib/tflib/custom_ops.py#L57:

def _get_cuda_gpu_arch_string():
    gpus = [x for x in device_lib.list_local_devices() if x.device_type == 'GPU']
    if len(gpus) == 0:
        raise RuntimeError('No GPU devices found')
    (major, minor) = _get_compute_cap(gpus[0])
    return 'sm_%s%s' % (major, minor)

to:

def _get_cuda_gpu_arch_string():
    return 'sm_70'

I don't have an Ampere card in my machine to test this but it's worth seeing what happens with it. Problems might still remain, as TensorFlow (from the old container) would still be compiled against an old architecture and might be slower than what it could be.

reflare · 2020-10-23T01:53:50Z

Are you able to use RTX 30x0 when not using the docker image?

With the current unstable drivers, I am hesitant to roll them out onto my main machine.
For what it's worth, both TF1 and TF2 can be made work with the Nvidia NCG docker containers.

Tf 1.x only support up to Cuda 10.x
Check the compatible version of TF and Cuda here
I guess for using RTX 30x0 one has to build tensorflow from source

Nvidia/tensorflow (as opposed to tensorflow/tensorflow) supports CUDA 11.1 for TF 1.15.

reflare · 2020-10-23T02:55:25Z

@reflare Can you try changing this code:

https://github.com/NVlabs/stylegan2-ada/blob/main/dnnlib/tflib/custom_ops.py#L57:
def _get_cuda_gpu_arch_string():
    gpus = [x for x in device_lib.list_local_devices() if x.device_type == 'GPU']
    if len(gpus) == 0:
        raise RuntimeError('No GPU devices found')
    (major, minor) = _get_compute_cap(gpus[0])
    return 'sm_%s%s' % (major, minor)
to:
def _get_cuda_gpu_arch_string():
    return 'sm_70'
I don't have an Ampere card in my machine to test this but it's worth seeing what happens with it. Problems might still remain, as TensorFlow (from the old container) would still be compiled against an old architecture and might be slower than what it could be.

It got surprisingly far on CUDA 11.1 but unfortunately errored out with

2020-10-23 02:52:05.323323: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x5d014a0 2020-10-23 02:52:05.323974: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2020-10-23 02:52:07.613632: F tensorflow/core/kernels/cuda_solvers.cc:94] Check failed: cusolverDnCreate(&cusolver_dn_handle) == CUSOLVER_STATUS_SUCCESS Failed to create cuSolverDN instance. Aborted

I am monitoring memory use on both GPUs and limiting the run to a single GPU just to make sure it's not an issue with parallelism. Memory is not near being exhausted so it appears that the CuSolver cannot be created due to the mismatch in versions.

I have run a second test on CUDA 11.0 and it errors out with the same message. However on CUDA 11.0 it also shows the following:

2020-10-23 02:57:38.895674: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'

aiXander · 2020-10-23T13:50:30Z

Hey @reflare I had that error too, it has to do with the latest bugfix for the NaNs in tensorboard commited by @tkarras
Essentially, the bugfix works when you run without freezed, but when you are using some fixed layers in the discriminator, something goes wrong in the tf.control_dependencies() which triggers this error ---> See #7
So I'd try to run with freezed=0 (default) and see if that also triggers the CUSOLVER_STATUS_SUCCESS Failed to create cuSolverDN instance error!

ps: I would be very interested in hearing how you got this working for Cuda 11.* since I'm also planning on ordering a few RTX 3090 cards myself..

reflare · 2020-10-25T07:05:25Z

@tr1pzz Thank you for the input. That said, I didn't get around to testing it because:

Nvidia just released a new NGC image with native support for Tensorflow 1.15 and CUDA 11.1.
Long story short, the code can be run as-is from that image.
I recommend that the official docker file is switched to be based on nvcr.io/nvidia/tensorflow:20.10-tf1-py3 since the official tensorflow TF1 does not intend to support CUDA 11 so Nvidia Tensorflow seems like the logical choice.

This works out of the box on newer NVIDIA GPUs such as RTX 3090. Note that an NVIDIA driver release r455.23 or later is required to run this image. See README for instructions on how to build for older drivers.

nurpax · 2020-10-26T12:17:22Z

@reflare Thanks for the bug report and your comments on trying out nvcr.io/nvidia/tensorflow:20.10-tf1-py3!

I pushed a change that defaults to this base image. This requires a pretty new driver (r455.23) to run so I also added a README comment on how to revert back to the old base image.

aisdn · 2021-01-20T07:48:48Z

@reflare Thanks for the bug report and your comments on trying out nvcr.io/nvidia/tensorflow:20.10-tf1-py3!

I pushed a change that defaults to this base image. This requires a pretty new driver (r455.23) to run so I also added a README comment on how to revert back to the old base image.

Hi, I have finished to run nvcr.io/nvidia/tensorflow:20.10-tf1-py3 to train stylegan2-ada progect on 3090 with cuda11. But I found that the train speed seems to be about twice as slow on 3090 than on 1080ti. I wonder if everyone has the same problem. Are there some ways to increase the training speed in dokcer with 3090?

johndpope · 2021-01-20T10:11:57Z

I got stylegan2-ada working without docker by bumping tensorflow to version 2 in compatibility mode. It’s quite trivial. nvlabs should do this or we pray they release a tensorflow1.16 with cuda 11 support / they probably won’t. You’ll need cuda toolkit 11.2 / driver 460 to get it working. You can use timeshift to save a snapshot of your working config.
https://www.github.com/johndpope/stylegan2-ada

for training mode check the digressions branch / it has one tiny fix.

…NVlabs#10)" This reverts commit 8c3fd8b.

rubmz · 2022-04-22T21:53:37Z

So 2022 ... I got 3090 and can't do stylegan 2 ? (sad face here...)

Because cuda/pytorch 10 will not play well with my hardware...

jannehellsten · 2022-04-23T08:25:38Z

This repository (stylegan2-ada) is in fact the TensorFlow version of StyleGAN2 ADA which is completely unsupported and unmaintained.

You should be able to get the pytorch version (stylegan2-ada-pytorch or even more recently, stylegan3) working on 3090. I don't see why something like pytorch 1.11 with CUDA 11.3 wouldn't work on 3090.

waffletower · 2022-07-17T16:17:27Z

I have just given up on using this tensorflow 1.15 based model on RTX 3090. I had great success on an RTX 2080 Ti but have not been able to successfully train with RTX 3090. While still unworkable, I came closest to success upgrading the base image to 22.06-tf1-py3. Augmentation was still broken with the added bonus of cpu-side memory leaks, and training without augmentation, at least with my particular 5000 image dataset, was untenable.

You should be able to get the pytorch version (stylegan2-ada-pytorch or even more recently, stylegan3) working on 3090. I don't see why something like pytorch 1.11 with CUDA 11.3 wouldn't work on 3090.

There are several github issues on stylegan2-ada-pytorch that indicate that the model is not compatible with pytorch 1.11. The code tree does have some explicit commits to support pytorch 1.9. It is unclear whether pytorch 1.10 will play well. The official base image for the repo is still using pytorch 1.8. I am about to switch over and see how successful stylegan2-ada-pytorch is with RTX 3090.

nurpax closed this as completed Oct 26, 2020

nurpax mentioned this issue Nov 16, 2020

rtx 3000 series broken compatibility #32

Closed

8secz-johndpope pushed a commit to johndpope/stylegan2-ada that referenced this issue Dec 29, 2020

potential fix for NVlabs#10

922bf72

8secz-johndpope pushed a commit to johndpope/stylegan2-ada that referenced this issue Dec 29, 2020

better solution for NVlabs#10

d64f356

worosom added a commit to RefikAnadolStudio/stylegan2-ada that referenced this issue Jan 14, 2022

Revert "Default to nvcr.io/nvidia/tensorflow:20.10-tf1-py3 base image (…

344f3ec

…NVlabs#10)" This reverts commit 8c3fd8b.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RTX 30x0 Support #10

RTX 30x0 Support #10

reflare commented Oct 16, 2020

aiXander commented Oct 18, 2020

levindabhi commented Oct 20, 2020

nurpax commented Oct 20, 2020 •

edited

Loading

reflare commented Oct 23, 2020

reflare commented Oct 23, 2020 •

edited

Loading

aiXander commented Oct 23, 2020 •

edited

Loading

reflare commented Oct 25, 2020

nurpax commented Oct 26, 2020

aisdn commented Jan 20, 2021

johndpope commented Jan 20, 2021 •

edited

Loading

rubmz commented Apr 22, 2022

jannehellsten commented Apr 23, 2022 •

edited

Loading

waffletower commented Jul 17, 2022 •

edited

Loading

RTX 30x0 Support #10

RTX 30x0 Support #10

Comments

reflare commented Oct 16, 2020

aiXander commented Oct 18, 2020

levindabhi commented Oct 20, 2020

nurpax commented Oct 20, 2020 • edited Loading

reflare commented Oct 23, 2020

reflare commented Oct 23, 2020 • edited Loading

aiXander commented Oct 23, 2020 • edited Loading

reflare commented Oct 25, 2020

nurpax commented Oct 26, 2020

aisdn commented Jan 20, 2021

johndpope commented Jan 20, 2021 • edited Loading

rubmz commented Apr 22, 2022

jannehellsten commented Apr 23, 2022 • edited Loading

waffletower commented Jul 17, 2022 • edited Loading

nurpax commented Oct 20, 2020 •

edited

Loading

reflare commented Oct 23, 2020 •

edited

Loading

aiXander commented Oct 23, 2020 •

edited

Loading

johndpope commented Jan 20, 2021 •

edited

Loading

jannehellsten commented Apr 23, 2022 •

edited

Loading

waffletower commented Jul 17, 2022 •

edited

Loading