Skip to content
This repository has been archived by the owner on Jan 10, 2023. It is now read-only.

Error "Retval[7] does not have value" when training SSGAN #43

Open
hankook opened this issue Jan 18, 2020 · 3 comments
Open

Error "Retval[7] does not have value" when training SSGAN #43

hankook opened this issue Jan 18, 2020 · 3 comments

Comments

@hankook
Copy link

hankook commented Jan 18, 2020

My current tensorflow, cuda and cudnn are 1.13.2, 10.0 and 7.6.5, respectively. I also tried other versions (1.14 and 1.15 for tensorflow), but I got same error messages. Details are described below.

When training SSGAN, I used the following gin configuration, which is slightly modified from examples/resnet_cifar10.gin:

dataset.name = "cifar10"
options.architecture = "resnet_cifar_arch"
options.batch_size = 64
options.gan_class = @SSGAN
options.lamba = 1
options.training_steps = 40000
options.z_dim = 128

# Generator
G.batch_norm_fn = @batch_norm
standardize_batch.decay = 0.9
standardize_batch.epsilon = 1e-5

# Discriminator
options.disc_iters = 5
D.spectral_norm = True

# Loss and optimizer
loss.fn = @non_saturating
penalty.fn = @no_penalty
SSGAN.g_lr = 0.0002
SSGAN.g_optimizer_fn = @tf.train.AdamOptimizer
SSGAN.rotated_batch_size = 64
tf.train.AdamOptimizer.beta1 = 0.5
tf.train.AdamOptimizer.beta2 = 0.999

Then, the below error message was occurred:

Traceback (most recent call last):
  File "main.py", line 133, in <module>                                                                                                                                                                                            [24/1911]
    app.run(main)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "main.py", line 127, in main
    eval_every_steps=FLAGS.eval_every_steps)
  File "/home/hankook/Codes/compare_gan/compare_gan/runner_lib.py", line 337, in run_with_schedule
    hooks=train_hooks)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2457, in train
    rendezvous.raise_errors()
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2452, in train
    saving_listeners=saving_listeners)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1327, in run
    run_metadata=run_metadata)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1091, in run
    return self._sess.run(*args, **kwargs)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Retval[7] does not have value

When using examples/resnet_cifar10.gin, the training code was working successfully. How to fix this issue? Is there any gin configuration examples for SSGAN?

@zengsn
Copy link

zengsn commented Feb 15, 2020

Yes. I got the same error. I found that when setting options.disc_iters>1, the issue occurs, no matter what type of GAN architecture is used.

I tried to debug it but got no workaround so far. Could you pls help for this?

@Marvin182

@zengsn
Copy link

zengsn commented Feb 20, 2020

After debug, I found that we need to add one more setting if training on GPU, instead of TPU.

ModularGAN.experimental_force_graph_unroll=True
options.disc_iters = 2  # if > 1

But as the code suggesting, make sure your GPU has big enough memory.

Welcome to discuss.

@czzerone
Copy link

@zengsn @hankook hi, I got the same error, have you solve this problem?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants