Skip to content
This repository has been archived by the owner on Aug 3, 2021. It is now read-only.

Multi GPU training hangs #448

Open
muntasir2000 opened this issue May 24, 2019 · 13 comments
Open

Multi GPU training hangs #448

muntasir2000 opened this issue May 24, 2019 · 13 comments
Labels

Comments

@muntasir2000
Copy link

muntasir2000 commented May 24, 2019

When I try to train DeepSpeech2 using example configs using 3 GPUs, training hangs indefinitely. But single GPU training works well using same config file. I also tried using horovod. Same problem.
I'm using nvcr.io/nvidia/tensorflow:18.12-py3 docker image

@borisgin
Copy link
Contributor

Can you attach the log file, please?

@muntasir2000
Copy link
Author

muntasir2000 commented May 24, 2019

`
*** Starting training from scratch
*** Training config:
{'batch_size_per_gpu': 20,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'augmentation': {'noise_level_max': -60,
'noise_level_min': -90,
'speed_perturbation_ratio': 0.1},
'dataset_files': ['/hdd/stt-16k-seq2seq-train.csv'],
'input_type': 'spectrogram',
'max_duration': 16.7,
'num_audio_features': 160,
'shuffle': True,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.0,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'lm_path': '/lm/lm.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': tf.float32,
'encoder': <class 'open_seq2seq.encoders.ds2_encoder.DeepSpeech2Encoder'>,
'encoder_params': {'activation_fn': <function relu at 0x7f2f9e611bf8>,
'conv_layers': [{'kernel_size': [11, 41],
'num_channels': 32,
'padding': 'SAME',
'stride': [2, 2]},
{'kernel_size': [11, 21],
'num_channels': 64,
'padding': 'SAME',
'stride': [1, 2]},
{'kernel_size': [11, 21],
'num_channels': 96,
'padding': 'SAME',
'stride': [1, 2]}],
'data_format': 'channels_first',
'dropout_keep_prob': 0.5,
'n_hidden': 1600,
'num_rnn_layers': 5,
'rnn_cell_dim': 800,
'rnn_type': 'cudnn_gru',
'rnn_unidirectional': False,
'row_conv': False,
'use_cudnn_rnn': True},
'eval_steps': 500,
'initializer': <function xavier_initializer at 0x7f2f800be9d8>,
'larc_params': {'larc_eta': 0.001},
'load_model': '',
'logdir': 'experiments/2-mfi/logs',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
'lr_policy': <function poly_decay at 0x7f2f79918840>,
'lr_policy_params': {'learning_rate': 0.0001, 'power': 0.5},
'num_epochs': 50,
'num_gpus': 3,
'optimizer': 'Adam',
'print_loss_steps': 10,
'print_samples_steps': 500,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7f2f80022c80>,
'regularizer_params': {'scale': 0.0005},
'save_checkpoint_steps': 1000,
'save_summaries_steps': 100,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': False,
'use_xla_jit': False}
*** Evaluation config:
{'batch_size_per_gpu': 20,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'dataset_files': ['/hdd/stt-16k-seq2seq-dev.csv'],
'input_type': 'spectrogram',
'num_audio_features': 160,
'shuffle': False,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.0,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'lm_path': '/lm/lm.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': tf.float32,
'encoder': <class 'open_seq2seq.encoders.ds2_encoder.DeepSpeech2Encoder'>,
'encoder_params': {'activation_fn': <function relu at 0x7f2f9e611bf8>,
'conv_layers': [{'kernel_size': [11, 41],
'num_channels': 32,
'padding': 'SAME',
'stride': [2, 2]},
{'kernel_size': [11, 21],
'num_channels': 64,
'padding': 'SAME',
'stride': [1, 2]},
{'kernel_size': [11, 21],
'num_channels': 96,
'padding': 'SAME',
'stride': [1, 2]}],
'data_format': 'channels_first',
'dropout_keep_prob': 0.5,
'n_hidden': 1600,
'num_rnn_layers': 5,
'rnn_cell_dim': 800,
'rnn_type': 'cudnn_gru',
'rnn_unidirectional': False,
'row_conv': False,
'use_cudnn_rnn': True},
'eval_steps': 500,
'initializer': <function xavier_initializer at 0x7f2f800be9d8>,
'larc_params': {'larc_eta': 0.001},
'load_model': '',
'logdir': 'experiments/2-mfi/logs',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
'lr_policy': <function poly_decay at 0x7f2f79918840>,
'lr_policy_params': {'learning_rate': 0.0001, 'power': 0.5},
'num_epochs': 50,
'num_gpus': 3,
'optimizer': 'Adam',
'print_loss_steps': 10,
'print_samples_steps': 500,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7f2f80022c80>,
'regularizer_params': {'scale': 0.0005},
'save_checkpoint_steps': 1000,
'save_summaries_steps': 100,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': False,
'use_xla_jit': False}
*** Building graph on GPU:0
*** Building graph on GPU:1
*** Building graph on GPU:2
*** Trainable variables:
*** ForwardPass/ds2_encoder/conv1/kernel:0
*** shape: (11, 41, 1, 32), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv1/bn/gamma:0
*** shape: (32,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv1/bn/beta:0
*** shape: (32,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv2/kernel:0
*** shape: (11, 21, 32, 64), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv2/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv2/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv3/kernel:0
*** shape: (11, 21, 64, 96), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv3/bn/gamma:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv3/bn/beta:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/cudnn_gru/opaque_kernel:0
*** shape: , <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/fully_connected/kernel:0
*** shape: (1600, 1600), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/fully_connected/bias:0
*** shape: (1600,), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
*** shape: (1600, 66), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
*** shape: (66,), <dtype: 'float32_ref'>
*** Encountered unknown variable shape, can't compute total number of parameters.
*** Building graph on GPU:0
*** Building graph on GPU:1
*** Building graph on GPU:2
2019-05-25 01:56:26.454436: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-25 01:56:26.644988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:09:00.0
totalMemory: 10.92GiB freeMemory: 10.77GiB
2019-05-25 01:56:26.767056: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325
pciBusID: 0000:0a:00.0
totalMemory: 10.92GiB freeMemory: 10.77GiB
2019-05-25 01:56:26.855649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:41:00.0
totalMemory: 10.91GiB freeMemory: 10.63GiB
2019-05-25 01:56:26.859528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2
2019-05-25 01:56:28.743446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-25 01:56:28.743487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2
2019-05-25 01:56:28.743499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y
2019-05-25 01:56:28.743508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y
2019-05-25 01:56:28.743517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N
2019-05-25 01:56:28.744812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10419 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1)
2019-05-25 01:56:28.746374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10419 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1)
2019-05-25 01:56:28.746584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10280 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:41:00.0, compute capability: 6.1)

`

This is the log file. Please note, this log was generated running without docker. But the problem is same with docker. It's just stuck there. Even I can't kill the process without restarting the PC.

Here is the output of nvidia-smi, if it helps. Thanks
gpu

@borisgin
Copy link
Contributor

Maybe mismatch between CUDA version/ driver and TF container.
Can you try latest container: tensorflow:19.04-py3 or tensorflow:19.05-py3, please?

@muntasir2000
Copy link
Author

I also tried without using docker container. Anyways, I'll try using tensorflow:19.05-py3 image.

@muntasir2000
Copy link
Author

muntasir2000 commented May 27, 2019

I tried using tensorflow:19.05-py3 docker image. Same issue. Training hangs
Log file follows -


WARNING: Please update time_stretch_ratio to speed_perturbation_ratio
WARNING: Please update time_stretch_ratio to speed_perturbation_ratio
*** Building graph on GPU:0
WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/data/speech2text/speech2text.py:216: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
tf.py_function, which takes a python function which manipulates tf eager
tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
an ndarray (just call tensor.numpy()) but having access to eager tensors
means tf.py_functions can use accelerators such as GPUs as well as
being differentiable using a gradient tape.

WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/dataset_ops.py:1419: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:159: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.conv2d instead.
WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:177: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.batch_normalization instead.
WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:387: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:389: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob.
*** Building graph on GPU:1
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
*** Trainable variables:
*** ForwardPass/ds2_encoder/conv1/kernel:0
*** shape: (11, 41, 1, 32), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv1/bn/gamma:0
*** shape: (32,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv1/bn/beta:0
*** shape: (32,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv2/kernel:0
*** shape: (11, 21, 32, 64), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv2/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv2/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv3/kernel:0
*** shape: (11, 21, 64, 96), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv3/bn/gamma:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv3/bn/beta:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/cudnn_gru/opaque_kernel:0
*** shape: , <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/fully_connected/kernel:0
*** shape: (1600, 1600), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/fully_connected/bias:0
*** shape: (1600,), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
*** shape: (1600, 66), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
*** shape: (66,), <dtype: 'float32_ref'>
*** Encountered unknown variable shape, can't compute total number of parameters.
*** Building graph on GPU:0
*** Building graph on GPU:1
2019-05-27 21:05:16.629783: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3792535000 Hz
2019-05-27 21:05:16.631137: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x12990420 executing computations on platform Host. Devices:
2019-05-27 21:05:16.631165: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): ,
2019-05-27 21:05:16.865771: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x12aacfb0 executing computations on platform CUDA. Devices:
2019-05-27 21:05:16.865821: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-05-27 21:05:16.865832: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-05-27 21:05:16.866588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325
pciBusID: 0000:0a:00.0
totalMemory: 10.92GiB freeMemory: 10.77GiB
2019-05-27 21:05:16.867113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:41:00.0
totalMemory: 10.91GiB freeMemory: 10.38GiB
2019-05-27 21:05:16.868351: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1
2019-05-27 21:05:18.512599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-27 21:05:18.512643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1
2019-05-27 21:05:18.512655: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y
2019-05-27 21:05:18.512659: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N
2019-05-27 21:05:18.513611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10413 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1)
2019-05-27 21:05:18.514182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10034 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:41:00.0, compute capability: 6.1)
2019-05-27 21:05:38.831218: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally


Output from nvidia-smi - (using gpu 1 and 2, gpu 0 is being used by another process)
Capture

@borisgin
Copy link
Contributor

borisgin commented May 28, 2019

Thanks, looks like a bug, I will check with our TF team for possible reason and solution.

@borisgin borisgin added the bug label May 28, 2019
@borisgin
Copy link
Contributor

borisgin commented Jun 6, 2019

Can you check if you can successfully run these nccl tests on that machine? https://github.com/nvidia/nccl-tests

@muntasir2000
Copy link
Author

I tried to run nccl-tests, but the test also hangs the same way OpenSeq2Seq hangs. All GPUs show 100% usage constantly but hangs.
I'm trying to follow this -
NVIDIA/caffe#10

I'll post the result. Thanks.

@lorinczb
Copy link

I am trying to run tacotron-gst on a single GPU, but hangs at the same spot, does not get past: Successfully opened dynamic library libcublas.so.10.0 this line. Was this issue resolved? I am running it on colaboratory.

@borisgin
Copy link
Contributor

Since this is not related to multi-GPU, can you open a new issue "Tacotron hangs on single GPU", please? Please attach the following

  1. system information - Ubuntu version, GPU, driver version (nvidia-smi)
  2. TF container information
  3. log file

@Shikherneo2
Copy link

Was this problem ever resolved? I am facing the same issue as @lorinczb

@MinaJf
Copy link

MinaJf commented Nov 5, 2020

I have the same issue, any new idea?

@swarajdalmia
Copy link

swarajdalmia commented Apr 9, 2021

Facing a similar issue for tacotron-GST. Any idea how to resolve ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants