Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi-node multi-GPU training. #63

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

csukuangfj
Copy link
Collaborator

Test with two machines, each with 3 GPUs.

Machine 1

export CUDA_VISIBLE_DEVICES="0,1,2"
cd egs/librispeech/ASR
./conformer_ctc/run-multi-node-multi-gpu.sh  --master-addr 10.177.74.141 --master-port 12356 --node-rank 0  --num-nodes 2 --full-libri 0
Click to view the log
CUDA_VISIBLE_DEVICES: 0,1,2
num_gpus: 3
master_addr: 10.177.74.141
+ python -m torch.distributed.launch --use_env --nproc_per_node 3 --nnodes 2 --node_rank 0 --master_addr 10.177.74.141 --master_port 12356 /ceph-fj/fa
ngjun/open-source/icefall-ctc-decoding/egs/librispeech/ASR/conformer_ctc/train.py --use-multi-node true --master-port 12356 --max-duration 200 --bucke
ting-sampler 1 --full-libri 0 --start-epoch 0 --num-epochs 50 --exp-dir conformer_ctc/exp2 --lang-dir data/lang_bpe_500
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the varia
ble for optimal performance in your application as needed.
*****************************************
rank: 0, world_size: 6
rank: 2, world_size: 6
rank: 0, world_size: 6, local_rank: 0
rank: 2, world_size: 6, local_rank: 2
rank: 1, world_size: 6
rank: 1, world_size: 6, local_rank: 1
2021-09-30 01:04:30,684 INFO [train.py:659] (2/6) Training started
2021-09-30 01:04:30,684 INFO [train.py:660] (2/6) {'exp_dir': PosixPath('conformer_ctc/exp2'), 'lang_dir': PosixPath('data/lang_bpe_500'), 'best_train
_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 10, 'reset_interval': 200,
'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'use_feat_batchnorm': True, 'attention_dim': 512, 'nhead': 8, 'num_decoder_layers'
: 6, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'att_rate': 0.7, 'weight_decay': 1e-06, 'lr_factor': 5.0, 'warm_step': 80000, 'us
e_multi_node': True, 'world_size': 6, 'master_port': 12356, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 0, 'full_libri': False, 'feature_dir
': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap
': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-09-30 01:04:30,684 INFO [train.py:659] (1/6) Training started
2021-09-30 01:04:30,685 INFO [train.py:660] (1/6) {'exp_dir': PosixPath('conformer_ctc/exp2'), 'lang_dir': PosixPath('data/lang_bpe_500'), 'best_train
_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 10, 'reset_interval': 200,
'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'use_feat_batchnorm': True, 'attention_dim': 512, 'nhead': 8, 'num_decoder_layers'
: 6, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'att_rate': 0.7, 'weight_decay': 1e-06, 'lr_factor': 5.0, 'warm_step': 80000, 'us
e_multi_node': True, 'world_size': 6, 'master_port': 12356, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 0, 'full_libri': False, 'feature_dir
': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap
': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-09-30 01:04:30,685 INFO [train.py:659] (0/6) Training started
2021-09-30 01:04:30,685 INFO [train.py:660] (0/6) {'exp_dir': PosixPath('conformer_ctc/exp2'), 'lang_dir': PosixPath('data/lang_bpe_500'), 'best_train
_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 10, 'reset_interval': 200,
'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'use_feat_batchnorm': True, 'attention_dim': 512, 'nhead': 8, 'num_decoder_layers'
: 6, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'att_rate': 0.7, 'weight_decay': 1e-06, 'lr_factor': 5.0, 'warm_step': 80000, 'us
e_multi_node': True, 'world_size': 6, 'master_port': 12356, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 0, 'full_libri': False, 'feature_dir
': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap
': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-09-30 01:04:31,051 INFO [lexicon.py:113] (0/6) Loading pre-compiled data/lang_bpe_500/Linv.pt
2021-09-30 01:04:31,052 INFO [lexicon.py:113] (1/6) Loading pre-compiled data/lang_bpe_500/Linv.pt
2021-09-30 01:04:31,052 INFO [lexicon.py:113] (2/6) Loading pre-compiled data/lang_bpe_500/Linv.pt
2021-09-30 01:04:31,093 INFO [train.py:674] (0/6) Device: cuda:0, rank: 0, local_rank: 0
2021-09-30 01:04:31,101 INFO [train.py:674] (1/6) Device: cuda:1, rank: 1, local_rank: 1
2021-09-30 01:04:31,102 INFO [train.py:674] (2/6) Device: cuda:2, rank: 2, local_rank: 2
2021-09-30 01:04:31,456 INFO [train.py:683] (0/6) About to create model
2021-09-30 01:04:31,459 INFO [train.py:683] (1/6) About to create model
2021-09-30 01:04:31,464 INFO [train.py:683] (2/6) About to create model
2021-09-30 01:05:20,482 INFO [asr_datamodule.py:158] (0/6) About to get train cuts
2021-09-30 01:05:20,482 INFO [asr_datamodule.py:319] (0/6) About to get train cuts
2021-09-30 01:05:21,417 INFO [asr_datamodule.py:158] (1/6) About to get train cuts
2021-09-30 01:05:21,418 INFO [asr_datamodule.py:319] (1/6) About to get train cuts
2021-09-30 01:05:22,326 INFO [asr_datamodule.py:158] (2/6) About to get train cuts
2021-09-30 01:05:22,326 INFO [asr_datamodule.py:319] (2/6) About to get train cuts
2021-09-30 01:05:25,998 INFO [asr_datamodule.py:161] (0/6) About to get Musan cuts
2021-09-30 01:05:27,005 INFO [asr_datamodule.py:161] (1/6) About to get Musan cuts
2021-09-30 01:05:27,847 INFO [asr_datamodule.py:161] (2/6) About to get Musan cuts
2021-09-30 01:05:28,554 INFO [asr_datamodule.py:164] (0/6) About to create train dataset
2021-09-30 01:05:28,555 INFO [asr_datamodule.py:216] (0/6) Using BucketingSampler.
2021-09-30 01:05:28,902 INFO [asr_datamodule.py:232] (0/6) About to create train dataloader
2021-09-30 01:05:28,902 INFO [asr_datamodule.py:245] (0/6) About to get dev cuts
2021-09-30 01:05:28,902 INFO [asr_datamodule.py:337] (0/6) About to get dev cuts
2021-09-30 01:05:29,131 INFO [asr_datamodule.py:256] (0/6) About to create dev dataset
2021-09-30 01:05:29,133 INFO [asr_datamodule.py:275] (0/6) About to create dev dataloader
2021-09-30 01:05:29,133 INFO [train.py:727] (0/6) epoch 0, learning rate 0
2021-09-30 01:05:29,582 INFO [asr_datamodule.py:164] (1/6) About to create train dataset
2021-09-30 01:05:29,583 INFO [asr_datamodule.py:216] (1/6) Using BucketingSampler.
2021-09-30 01:05:29,955 INFO [asr_datamodule.py:232] (1/6) About to create train dataloader
2021-09-30 01:05:29,955 INFO [asr_datamodule.py:245] (1/6) About to get dev cuts
2021-09-30 01:05:29,955 INFO [asr_datamodule.py:337] (1/6) About to get dev cuts
2021-09-30 01:05:30,187 INFO [asr_datamodule.py:256] (1/6) About to create dev dataset
2021-09-30 01:05:30,189 INFO [asr_datamodule.py:275] (1/6) About to create dev dataloader
2021-09-30 01:05:30,400 INFO [asr_datamodule.py:164] (2/6) About to create train dataset
2021-09-30 01:05:30,400 INFO [asr_datamodule.py:216] (2/6) Using BucketingSampler.
2021-09-30 01:05:30,753 INFO [asr_datamodule.py:232] (2/6) About to create train dataloader
2021-09-30 01:05:30,753 INFO [asr_datamodule.py:245] (2/6) About to get dev cuts
2021-09-30 01:05:30,754 INFO [asr_datamodule.py:337] (2/6) About to get dev cuts
2021-09-30 01:05:30,985 INFO [asr_datamodule.py:256] (2/6) About to create dev dataset
2021-09-30 01:05:30,987 INFO [asr_datamodule.py:275] (2/6) About to create dev dataloader
2021-09-30 01:06:45,592 INFO [train.py:542] (1/6) Epoch 0, batch 0, batch avg ctc loss 5.2093, batch avg att loss 1.2137, batch avg loss 2.4124, total
 avg ctc loss: 5.2093, total avg att loss: 1.2137, total avg loss: 2.4124, batch size: 14
2021-09-30 01:06:45,633 INFO [train.py:542] (0/6) Epoch 0, batch 0, batch avg ctc loss 5.1806, batch avg att loss 1.2520, batch avg loss 2.4306, total
 avg ctc loss: 5.1806, total avg att loss: 1.2520, total avg loss: 2.4306, batch size: 14
2021-09-30 01:06:45,641 INFO [distributed.py:607] (0/6) Reducer buckets have been rebuilt in this iteration.
2021-09-30 01:06:45,641 INFO [distributed.py:607] (1/6) Reducer buckets have been rebuilt in this iteration.
2021-09-30 01:06:45,647 INFO [train.py:542] (2/6) Epoch 0, batch 0, batch avg ctc loss 5.2156, batch avg att loss 1.1260, batch avg loss 2.3528, total
 avg ctc loss: 5.2156, total avg att loss: 1.1260, total avg loss: 2.3528, batch size: 14
2021-09-30 01:06:45,655 INFO [distributed.py:607] (2/6) Reducer buckets have been rebuilt in this iteration.

2021-09-30 01:15:43,095 INFO [train.py:542] (2/6) Epoch 0, batch 10, batch avg ctc loss 5.1741, batch avg att loss 1.1457, batch avg loss 2.3542, tota
l avg ctc loss: 5.1941, total avg att loss: 1.1826, total avg loss: 2.3861, batch size: 18
2021-09-30 01:15:43,188 INFO [train.py:542] (1/6) Epoch 0, batch 10, batch avg ctc loss 5.1619, batch avg att loss 1.2220, batch avg loss 2.4040, tota
l avg ctc loss: 5.2039, total avg att loss: 1.2096, total avg loss: 2.4079, batch size: 18
2021-09-30 01:15:43,249 INFO [train.py:542] (0/6) Epoch 0, batch 10, batch avg ctc loss 5.1668, batch avg att loss 1.2987, batch avg loss 2.4591, tota
l avg ctc loss: 5.2095, total avg att loss: 1.1882, total avg loss: 2.3946, batch size: 18

Machine 2

export CUDA_VISIBLE_DEVICES="0,1,2"
cd egs/librispeech/ASR
./conformer_ctc/run-multi-node-multi-gpu.sh  --master-addr 10.177.74.141 --master-port 12356 --node-rank 1  --num-nodes 2 --full-libri 0
Click to view the log
CUDA_VISIBLE_DEVICES: 0,1,2
num_gpus: 3
master_addr: 10.177.74.141
+ python -m torch.distributed.launch --use_env --nproc_per_node 3 --nnodes 2 --node_rank 1 --master_addr 10.177.74.141 --master_port 12356 /ceph-fj/fa
ngjun/open-source/icefall-ctc-decoding/egs/librispeech/ASR/conformer_ctc/train.py --use-multi-node true --master-port 12356 --max-duration 200 --bucke
ting-sampler 1 --full-libri 0 --start-epoch 0 --num-epochs 50 --exp-dir conformer_ctc/exp2 --lang-dir data/lang_bpe_500
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the varia
ble for optimal performance in your application as needed.
*****************************************
rank: 3, world_size: 6rank: 4, world_size: 6

rank: 3, world_size: 6, local_rank: 0rank: 4, world_size: 6, local_rank: 1

rank: 5, world_size: 6
rank: 5, world_size: 6, local_rank: 2
2021-09-30 01:04:30,713 INFO [train.py:659] (4/6) Training started
2021-09-30 01:04:30,713 INFO [train.py:660] (4/6) {'exp_dir': PosixPath('conformer_ctc/exp2'), 'lang_dir': PosixPath('data/lang_bpe_500'), 'best_trai$
_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 10, 'reset_interval': 200,
'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'use_feat_batchnorm': True, 'attention_dim': 512, 'nhead': 8, 'num_decoder_layers$
: 6, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'att_rate': 0.7, 'weight_decay': 1e-06, 'lr_factor': 5.0, 'warm_step': 80000, 'us
e_multi_node': True, 'world_size': 6, 'master_port': 12356, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 0, 'full_libri': False, 'feature_dir
': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap
': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-09-30 01:04:30,726 INFO [train.py:659] (3/6) Training started
2021-09-30 01:04:30,726 INFO [train.py:660] (3/6) {'exp_dir': PosixPath('conformer_ctc/exp2'), 'lang_dir': PosixPath('data/lang_bpe_500'), 'best_train
_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 10, 'reset_interval': 200,
'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'use_feat_batchnorm': True, 'attention_dim': 512, 'nhead': 8, 'num_decoder_layers'
: 6, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'att_rate': 0.7, 'weight_decay': 1e-06, 'lr_factor': 5.0, 'warm_step': 80000, 'us
e_multi_node': True, 'world_size': 6, 'master_port': 12356, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 0, 'full_libri': False, 'feature_dir
': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap
': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-09-30 01:04:30,739 INFO [train.py:659] (5/6) Training started

2021-09-30 01:04:30,739 INFO [train.py:660] (5/6) {'exp_dir': PosixPath('conformer_ctc/exp2'), 'lang_dir': PosixPath('data/lang_bpe_500'), 'best_train
_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 10, 'reset_interval': 200,
'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'use_feat_batchnorm': True, 'attention_dim': 512, 'nhead': 8, 'num_decoder_layers'
: 6, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'att_rate': 0.7, 'weight_decay': 1e-06, 'lr_factor': 5.0, 'warm_step': 80000, 'us
e_multi_node': True, 'world_size': 6, 'master_port': 12356, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 0, 'full_libri': False, 'feature_dir
': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap
': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-09-30 01:04:31,132 INFO [lexicon.py:113] (4/6) Loading pre-compiled data/lang_bpe_500/Linv.pt
2021-09-30 01:04:31,142 INFO [lexicon.py:113] (3/6) Loading pre-compiled data/lang_bpe_500/Linv.pt
2021-09-30 01:04:31,166 INFO [lexicon.py:113] (5/6) Loading pre-compiled data/lang_bpe_500/Linv.pt

2021-09-30 01:04:31,184 INFO [train.py:674] (4/6) Device: cuda:1, rank: 4, local_rank: 1
2021-09-30 01:04:31,195 INFO [train.py:674] (3/6) Device: cuda:0, rank: 3, local_rank: 0
2021-09-30 01:04:31,212 INFO [train.py:674] (5/6) Device: cuda:2, rank: 5, local_rank: 2
2021-09-30 01:04:31,532 INFO [train.py:683] (3/6) About to create model
2021-09-30 01:04:31,534 INFO [train.py:683] (4/6) About to create model
2021-09-30 01:04:31,538 INFO [train.py:683] (5/6) About to create model
2021-09-30 01:05:23,157 INFO [asr_datamodule.py:158] (3/6) About to get train cuts
2021-09-30 01:05:23,157 INFO [asr_datamodule.py:158] (4/6) About to get train cuts
2021-09-30 01:05:23,157 INFO [asr_datamodule.py:319] (3/6) About to get train cuts
2021-09-30 01:05:23,157 INFO [asr_datamodule.py:319] (4/6) About to get train cuts
2021-09-30 01:05:23,157 INFO [asr_datamodule.py:158] (5/6) About to get train cuts
2021-09-30 01:05:23,157 INFO [asr_datamodule.py:319] (5/6) About to get train cuts
2021-09-30 01:05:29,252 INFO [asr_datamodule.py:161] (4/6) About to get Musan cuts
2021-09-30 01:05:29,258 INFO [asr_datamodule.py:161] (5/6) About to get Musan cuts
2021-09-30 01:05:29,290 INFO [asr_datamodule.py:161] (3/6) About to get Musan cuts

2021-09-30 01:05:32,202 INFO [asr_datamodule.py:164] (4/6) About to create train dataset
2021-09-30 01:05:32,202 INFO [asr_datamodule.py:216] (4/6) Using BucketingSampler.
2021-09-30 01:05:32,213 INFO [asr_datamodule.py:164] (5/6) About to create train dataset
2021-09-30 01:05:32,214 INFO [asr_datamodule.py:216] (5/6) Using BucketingSampler.
2021-09-30 01:05:32,233 INFO [asr_datamodule.py:164] (3/6) About to create train dataset
2021-09-30 01:05:32,233 INFO [asr_datamodule.py:216] (3/6) Using BucketingSampler.
2021-09-30 01:05:32,608 INFO [asr_datamodule.py:232] (5/6) About to create train dataloader
2021-09-30 01:05:32,608 INFO [asr_datamodule.py:245] (5/6) About to get dev cuts
2021-09-30 01:05:32,608 INFO [asr_datamodule.py:337] (5/6) About to get dev cuts
2021-09-30 01:05:32,610 INFO [asr_datamodule.py:232] (4/6) About to create train dataloader
2021-09-30 01:05:32,610 INFO [asr_datamodule.py:245] (4/6) About to get dev cuts
2021-09-30 01:05:32,610 INFO [asr_datamodule.py:337] (4/6) About to get dev cuts
2021-09-30 01:05:32,638 INFO [asr_datamodule.py:232] (3/6) About to create train dataloader
2021-09-30 01:05:32,638 INFO [asr_datamodule.py:245] (3/6) About to get dev cuts
2021-09-30 01:05:32,638 INFO [asr_datamodule.py:337] (3/6) About to get dev cuts
2021-09-30 01:05:32,896 INFO [asr_datamodule.py:256] (5/6) About to create dev dataset
2021-09-30 01:05:32,898 INFO [asr_datamodule.py:275] (5/6) About to create dev dataloader
2021-09-30 01:05:32,902 INFO [asr_datamodule.py:256] (4/6) About to create dev dataset
2021-09-30 01:05:32,904 INFO [asr_datamodule.py:275] (4/6) About to create dev dataloader
2021-09-30 01:05:32,938 INFO [asr_datamodule.py:256] (3/6) About to create dev dataset
2021-09-30 01:05:32,940 INFO [asr_datamodule.py:275] (3/6) About to create dev dataloader

2021-09-30 01:06:45,761 INFO [train.py:542] (5/6) Epoch 0, batch 0, batch avg ctc loss 5.2337, batch avg att loss 1.3110, batch avg loss 2.4878, total
 avg ctc loss: 5.2337, total avg att loss: 1.3110, total avg loss: 2.4878, batch size: 14
2021-09-30 01:06:45,766 INFO [train.py:542] (3/6) Epoch 0, batch 0, batch avg ctc loss 5.1945, batch avg att loss 1.2462, batch avg loss 2.4307, total
 avg ctc loss: 5.1945, total avg att loss: 1.2462, total avg loss: 2.4307, batch size: 14
2021-09-30 01:06:45,768 INFO [train.py:542] (4/6) Epoch 0, batch 0, batch avg ctc loss 5.1791, batch avg att loss 1.1169, batch avg loss 2.3356, total
 avg ctc loss: 5.1791, total avg att loss: 1.1169, total avg loss: 2.3356, batch size: 14
2021-09-30 01:06:45,775 INFO [distributed.py:607] (3/6) Reducer buckets have been rebuilt in this iteration.
2021-09-30 01:06:45,788 INFO [distributed.py:607] (4/6) Reducer buckets have been rebuilt in this iteration.
2021-09-30 01:06:45,800 INFO [distributed.py:607] (5/6) Reducer buckets have been rebuilt in this iteration.
2021-09-30 01:15:43,324 INFO [train.py:542] (5/6) Epoch 0, batch 10, batch avg ctc loss 5.1695, batch avg att loss 1.2584, batch avg loss 2.4317, tota
l avg ctc loss: 5.2067, total avg att loss: 1.1878, total avg loss: 2.3935, batch size: 18
2021-09-30 01:15:43,408 INFO [train.py:542] (4/6) Epoch 0, batch 10, batch avg ctc loss 5.1715, batch avg att loss 1.0996, batch avg loss 2.3212, tota
l avg ctc loss: 5.1985, total avg att loss: 1.1668, total avg loss: 2.3763, batch size: 18
2021-09-30 01:15:43,481 INFO [train.py:542] (3/6) Epoch 0, batch 10, batch avg ctc loss 5.1843, batch avg att loss 1.2252, batch avg loss 2.4130, tota
l avg ctc loss: 5.2056, total avg att loss: 1.2082, total avg loss: 2.4074, batch size: 18
2021-09-30 01:24:43,732 INFO [train.py:542] (5/6) Epoch 0, batch 20, batch avg ctc loss 5.0552, batch avg att loss 1.1863, batch avg loss 2.3470, tota
l avg ctc loss: 5.1764, total avg att loss: 1.1698, total avg loss: 2.3718, batch size: 46
2021-09-30 01:24:43,871 INFO [train.py:542] (4/6) Epoch 0, batch 20, batch avg ctc loss 5.0398, batch avg att loss 1.3006, batch avg loss 2.4224, tota
l avg ctc loss: 5.1643, total avg att loss: 1.1882, total avg loss: 2.3810, batch size: 47
2021-09-30 01:24:44,208 INFO [train.py:542] (3/6) Epoch 0, batch 20, batch avg ctc loss 5.0762, batch avg att loss 1.2196, batch avg loss 2.3766, tota
l avg ctc loss: 5.1710, total avg att loss: 1.1864, total avg loss: 2.3818, batch size: 47

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Sep 30, 2021

Looks like each node MUST provide equal number of GPUs for training.
That is, you cannot have 1 GPU on node 1 and 2 GPUs on node 2.

[EDITED]: This is not true for PyTorch 1.9.0, which uses elastic APIs for DDP. See https://pytorch.org/docs/stable/elastic/run.html

@danpovey
Copy link
Collaborator

danpovey commented Sep 30, 2021 via email

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Sep 30, 2021

Cool!
Is it significantly slower per minibatch, vs. just using one machine?

A little slower than a single machine, as expected due to communication overhead.

The training speed depends on how the machines are connected.


Single machine (base line)

The following shows part of the training log for single-node multi-GPU (3 GPUs) training.

You can see that it takes 7 to 8 seconds per 10 batches

2021-09-20 18:58:49,564 INFO [train-5000.py:686] (0/3) epoch 0, learning rate 0
2021-09-20 18:58:59,491 INFO [train-5000.py:506] (0/3) Epoch 0, batch 0, batch avg ctc loss 7.6195, batch avg att loss 1.0745, batch
avg loss 3.0380, total avg ctc loss: 7.6195, total avg att loss: 1.0745, total avg loss: 3.0380, batch size: 14
2021-09-20 18:58:59,511 INFO [distributed.py:607] (0/3) Reducer buckets have been rebuilt in this iteration.
2021-09-20 18:59:06,828 INFO [train-5000.py:506] (0/3) Epoch 0, batch 10, batch avg ctc loss 7.7028, batch avg att loss 1.0299, batch avg loss 3.0318, total avg ctc loss: 7.6723, total avg att loss: 1.0413, total avg loss: 3.0306, batch size: 19
2021-09-20 18:59:14,322 INFO [train-5000.py:506] (0/3) Epoch 0, batch 20, batch avg ctc loss 7.6804, batch avg att loss 1.1422, batch avg loss 3.1036, total avg ctc loss: 7.6598, total avg att loss: 1.0417, total avg loss: 3.0271, batch size: 52
2021-09-20 18:59:21,679 INFO [train-5000.py:506] (0/3) Epoch 0, batch 30, batch avg ctc loss 7.6165, batch avg att loss 1.1222, batch avg loss 3.0705, total avg ctc loss: 7.6440, total avg att loss: 1.0160, total avg loss: 3.0044, batch size: 16
2021-09-20 18:59:28,917 INFO [train-5000.py:506] (0/3) Epoch 0, batch 40, batch avg ctc loss 7.4494, batch avg att loss 0.9222, batch avg loss 2.8803, total avg ctc loss: 7.6286, total avg att loss: 1.0163, total avg loss: 3.0000, batch size: 11
2021-09-20 18:59:36,297 INFO [train-5000.py:506] (0/3) Epoch 0, batch 50, batch avg ctc loss 7.5320, batch avg att loss 1.0745, batch avg loss 3.0117, total avg ctc loss: 7.6024, total avg att loss: 1.0134, total avg loss: 2.9901, batch size: 16

Multi-node multi-gpu (first try)

  • node 1, IP: (10.177.74.xxx) with netmask (255.255.255.192), 3 GPUs
  • node 2, IP: (10.25.134.xxx) with netmask (255.255.255.224), 3 GPUs

I was using PyTorch 1.7.1. Unfortunately, there are no console logs, so we can only get the time statistics from the tensorboard log. Please see https://tensorboard.dev/experiment/8SxVEHZnSweuDW8rdxTw4g/#scalars&_smoothingWeight=0

You can see the timestamp from the tensorboard log. It takes about 4 to 5 minutes per 10 batches,
which is significantly slower than that of single-node multi-GPU training.

I think one of the reasons is that the two cloud machines are not in the same region, so the communication overhead dominates the training time.

Multi-node multi-gpu (second try)

  • node 1: IP (10.25.130.xxx) with netmask (255.255.255.224), 3 GPUs
  • node 2: IP (10.25.134.xxx) with netmask (255.255.255.224), 3 GPUs

I was using PyTorch 1.7.1, so there are still no console logs. The tensorboard log can be found at https://tensorboard.dev/experiment/ebNLdWt5S96HbxWb39kTdg/

You can see from the tensorboard log that it takes 8 to 9 seconds per 10 batches, which is only a bit slower
than that of single-node multi GPU training, which is 7 to 8 seconds per 10 batches.

Multi-node multi-gpu (third try)

I have switched to PyTorch 1.9.0, which uses elastic launches (see https://pytorch.org/docs/stable/elastic/run.html).
It allows different nodes to use a different number of GPUs.

  • node 1: IP (10.25.130.xxx) with netmask (255.255.255.224), 4 GPUs
  • node 2: IP (10.25.134.xxx) with netmask (255.255.255.224), 3 GPUs

There are still no console logs. The tensorboard log is available at https://tensorboard.dev/experiment/O7NqpWADTJmZTN2h4KOw2w/
https://tensorboard.dev/experiment/8cQNaXY6Tq6tNudu6z3BpQ/#scalars&_smoothingWeight=0

You can see it takes about 8 to 9 seconds per 10 batches, comparable to the second try.

(Note: All runs use the same arguments, i.e., --max-duration=200, --bucketing-samplers=1, so the training time is comparable)

@csukuangfj
Copy link
Collaborator Author

The following are the training logs using two machines.
(Using PyTorch 1.9.0)

machine 1 (the master node)

$ export CUDA_VISIBLE_DEVICES="0,1,2,3"
$ ./conformer_ctc/run-multi-node-multi-gpu.sh --master-addr 10.25.130.9 --master-port 12356 --node-rank 0 --num-nodes 2 --full-libri 1
Click to view the detailed log
CUDA_VISIBLE_DEVICES: 0,1,2,3
num_gpus: 4
master_addr: 10.25.130.9
+ python -m torch.distributed.launch --use_env --nproc_per_node 4 --nnodes 2 --node_rank 0 --master_addr 10.25.130.9 --master_port 12
356 /ceph-fj/fangjun/open-source/icefall-ctc-decoding/egs/librispeech/ASR/conformer_ctc/train.py --use-multi-node true --master-port
12356 --max-duration 200 --bucketing-sampler 1 --full-libri 1 --start-epoch 0 --num-epochs 2 --exp-dir conformer_ctc/exp3 --lang-dir
data/lang_bpe_500
/ceph-fj/fangjun/py38-1.9.0/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is dep
recated, use 'warning' instead
  logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please furth
er tune the variable for optimal performance in your application as needed.
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : /ceph-fj/fangjun/open-source/icefall-ctc-decoding/egs/librispeech/ASR/conformer_ctc/train.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 10.25.130.9:12356
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_8wxw7w29/none_bhplv3z2
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/ceph-fj/fangjun/py38-1.9.0/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experi
mental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=10.25.130.9
  master_port=12356
  group_rank=0
  group_world_size=2
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[7, 7, 7, 7]
  global_world_sizes=[7, 7, 7, 7]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_8wxw7w29/none_bhplv3z2/attempt_0/0/er
ror.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_8wxw7w29/none_bhplv3z2/attempt_0/1/er
ror.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_8wxw7w29/none_bhplv3z2/attempt_0/2/er
ror.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_8wxw7w29/none_bhplv3z2/attempt_0/3/er
ror.json
rank: 1, world_size: 7
rank: 3, world_size: 7
rank: 0, world_size: 7
rank: 2, world_size: 7
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)

/ceph-fj/fangjun/py38-1.9.0/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
/ceph-fj/fangjun/py38-1.9.0/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
/ceph-fj/fangjun/py38-1.9.0/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
/ceph-fj/fangjun/py38-1.9.0/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
WARNING:root:To perform mix, energy must be non-zero and non-negative (got 0.0).  MonoCut with id "f15a2799-a890-64e2-deac-719041038d
c4" will not be mixed in.
2021-09-30 11:23:00,386 WARNING [cut.py:1694] (1/7) To perform mix, energy must be non-zero and non-negative (got 0.0).  MonoCut with id "f15a2799-a890-64e2-deac-719041038dc4" will not be mixed in.

It is strange that it prints the following log from logging.warn (from lhotse)

2021-09-30 11:23:00,386 WARNING [cut.py:1694] (1/7) To perform mix, energy must be non-zero and non-negative (got 0.0).  MonoCut with id "f15a2799-a890-64e2-deac-719041038dc4" will not be mixed in.

but logging.info is not displayed

machine 2

$ export CUDA_VISIBLE_DEVICES="0,1,2"
$ ./conformer_ctc/run-multi-node-multi-gpu.sh --master-addr 10.25.130.9 --master-port 12356 --node-rank 1 --num-nodes 2 --full-libri 1
Click to view the detailed log
CUDA_VISIBLE_DEVICES: 0,1,2
num_gpus: 3
master_addr: 10.25.130.9
+ python -m torch.distributed.launch --use_env --nproc_per_node 3 --nnodes 2 --node_rank 1 --master_addr 10.25.130.9 --master_port 12356 /ceph-fj/fangjun/open-source/icefall-ctc-decoding/egs/librispeech/ASR/conformer_ctc/train.py --use-multi-node true --master-port 12356 --max-duration 200 --bucketing-sampler 1 --full-libri 1 --start-epoch 0 --num-epochs 2 --exp-dir conformer_ctc/exp3 --lang-dir data/lang_bpe_500

/ceph-fj/fangjun/py38-1.9.0/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************

INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : /ceph-fj/fangjun/open-source/icefall-ctc-decoding/egs/librispeech/ASR/conformer_ctc/train.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 3
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 10.25.130.9:12356
  rdzv_configs     : {'rank': 1, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_6dd43y8s/none__vncbcyg
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/ceph-fj/fangjun/py38-1.9.0/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.

  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=10.25.130.9
  master_port=12356
  group_rank=1
  group_world_size=2
  local_ranks=[0, 1, 2]
  role_ranks=[4, 5, 6]
  global_ranks=[4, 5, 6]
  role_world_sizes=[7, 7, 7]
  global_world_sizes=[7, 7, 7]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_6dd43y8s/none__vncbcyg/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_6dd43y8s/none__vncbcyg/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_6dd43y8s/none__vncbcyg/attempt_0/2/error.json
rank: 5, world_size: 7rank: 4, world_size: 7rank: 6, world_size: 7

[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
/ceph-fj/fangjun/py38-1.9.0/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)

/ceph-fj/fangjun/py38-1.9.0/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
/ceph-fj/fangjun/py38-1.9.0/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)

WARNING:root:To perform mix, energy must be non-zero and non-negative (got 0.0).  MonoCut with id "faa50308-b5c0-9c48-e838-46158e3e7f
c5" will not be mixed in.
2021-09-30 11:20:47,665 WARNING [cut.py:1694] (6/7) To perform mix, energy must be non-zero and non-negative (got 0.0).  MonoCut with id "faa50308-b5c0-9c48-e838-46158e3e7fc5" will not be mixed in.
WARNING:root:To perform mix, energy must be non-zero and non-negative (got 0.0).  MonoCut with id "1757138c-78db-05f7-cf99-ca8b0a9c5fa1" will not be mixed in.
2021-09-30 11:49:43,733 WARNING [cut.py:1694] (6/7) To perform mix, energy must be non-zero and non-negative (got 0.0).  MonoCut with id "1757138c-78db-05f7-cf99-ca8b0a9c5fa1" will not be mixed in.

@csukuangfj
Copy link
Collaborator Author

I manage to get the console log by changing logging.info to logging.warning.

It turns out the training speed is not stable.

At the beginning of epoch 1, it takes about 40 seconds per 50 batches. See the log below

2021-09-30 17:34:05,265 WARNING [train.py:546] (0/7) Epoch 1, batch 0, batch avg ctc loss 0.4764, batch avg att loss 0.7291, batch avg loss 0.6533, total avg ctc loss: 0.4764, total avg att loss: 0.7291, total avg loss: 0.6533, batch size: 10
WARNING:root:Epoch 1, batch 50, batch avg ctc loss 0.4471, batch avg att loss 0.7007, batch avg loss 0.6246, total avg ctc loss: 0.4770, total avg att loss: 0.6918, total avg loss: 0.6274, batch size: 10
2021-09-30 17:34:45,024 WARNING [train.py:546] (0/7) Epoch 1, batch 50, batch avg ctc loss 0.4471, batch avg att loss 0.7007, batch avg loss 0.6246, total avg ctc loss: 0.4770, total avg att loss: 0.6918, total avg loss: 0.6274, batch size: 10
WARNING:root:Epoch 1, batch 100, batch avg ctc loss 0.5492, batch avg att loss 0.7107, batch avg loss 0.6623, total avg ctc loss: 0.4816, total avg att loss: 0.6920, total avg loss: 0.6289, batch size: 14
2021-09-30 17:35:25,843 WARNING [train.py:546] (0/7) Epoch 1, batch 100, batch avg ctc loss 0.5492, batch avg att loss 0.7107, batch avg loss 0.6623, total avg ctc loss: 0.4816, total avg att loss: 0.6920, total avg loss: 0.6289, batch size: 14
WARNING:root:Epoch 1, batch 150, batch avg ctc loss 0.4808, batch avg att loss 0.6649, batch avg loss 0.6097, total avg ctc loss: 0.48
49, total avg att loss: 0.6941, total avg loss: 0.6314, batch size: 8
2021-09-30 17:36:04,990 WARNING [train.py:546] (0/7) Epoch 1, batch 150, batch avg ctc loss 0.4808, batch avg att loss 0.6649, batch avg loss 0.6097, total avg ctc loss: 0.4849, total avg att loss: 0.6941, total avg loss: 0.6314, batch size: 8
WARNING:root:Epoch 1, batch 200, batch avg ctc loss 0.5185, batch avg att loss 0.7521, batch avg loss 0.6821, total avg ctc loss: 0.48
36, total avg att loss: 0.6958, total avg loss: 0.6321, batch size: 9
2021-09-30 17:36:46,014 WARNING [train.py:546] (0/7) Epoch 1, batch 200, batch avg ctc loss 0.5185, batch avg att loss 0.7521, batch avg loss 0.6821, total avg ctc loss: 0.4836, total avg att loss: 0.6958, total avg loss: 0.6321, batch size: 9

In later batches, the time needed doubles. See the log below.

2021-09-30 17:58:56,533 WARNING [train.py:546] (0/7) Epoch 1, batch 950, batch avg ctc loss 0.5290, batch avg att loss 0.7002, batch avg loss 0.6488, total avg ctc loss: 0.4581, total avg att loss: 0.6504, total avg loss: 0.5927, batch size: 15
WARNING:root:Epoch 1, batch 1000, batch avg ctc loss 0.3421, batch avg att loss 0.5315, batch avg loss 0.4747, total avg ctc loss: 0.4558, total avg att loss: 0.6478, total avg loss: 0.5902, batch size: 8
2021-09-30 18:01:25,202 WARNING [train.py:546] (0/7) Epoch 1, batch 1000, batch avg ctc loss 0.3421, batch avg att loss 0.5315, batch avg loss 0.4747, total avg ctc loss: 0.4558, total avg att loss: 0.6478, total avg loss: 0.5902, batch size: 8
WARNING:root:Epoch 1, batch 1050, batch avg ctc loss 0.3807, batch avg att loss 0.5738, batch avg loss 0.5159, total avg ctc loss: 0.4326, total avg att loss: 0.6181, total avg loss: 0.5625, batch size: 9
2021-09-30 18:03:47,349 WARNING [train.py:546] (0/7) Epoch 1, batch 1050, batch avg ctc loss 0.3807, batch avg att loss 0.5738, batch avg loss 0.5159, total avg ctc loss: 0.4326, total avg att loss: 0.6181, total avg loss: 0.5625, batch size: 9
WARNING:root:Epoch 1, batch 1100, batch avg ctc loss 0.5841, batch avg att loss 0.6984, batch avg loss 0.6641, total avg ctc loss: 0.4385, total avg att loss: 0.6205, total avg loss: 0.5659, batch size: 10
2021-09-30 18:05:30,333 WARNING [train.py:546] (0/7) Epoch 1, batch 1100, batch avg ctc loss 0.5841, batch avg att loss 0.6984, batch avg loss 0.6641, total avg ctc loss: 0.4385, total avg att loss: 0.6205, total avg loss: 0.5659, batch size: 10
WARNING:root:Epoch 1, batch 1150, batch avg ctc loss 0.3946, batch avg att loss 0.6428, batch avg loss 0.5683, total avg ctc loss: 0.4391, total avg att loss: 0.6164, total avg loss: 0.5632, batch size: 9
2021-09-30 18:07:17,134 WARNING [train.py:546] (0/7) Epoch 1, batch 1150, batch avg ctc loss 0.3946, batch avg att loss 0.6428, batch avg loss 0.5683, total avg ctc loss: 0.4391, total avg att loss: 0.6164, total avg loss: 0.5632, batch size: 9
WARNING:root:Epoch 1, batch 1200, batch avg ctc loss 0.4343, batch avg att loss 0.5765, batch avg loss 0.5339, total avg ctc loss: 0.4409, total avg att loss: 0.6142, total avg loss: 0.5622, batch size: 9
2021-09-30 18:09:12,466 WARNING [train.py:546] (0/7) Epoch 1, batch 1200, batch avg ctc loss 0.4343, batch avg att loss 0.5765, batch avg loss 0.5339, total avg ctc loss: 0.4409, total avg att loss: 0.6142, total avg loss: 0.5622, batch size: 9

@danpovey
Copy link
Collaborator

danpovey commented Sep 30, 2021 via email

@csukuangfj
Copy link
Collaborator Author

We have to figure out how to set the console log-level to info, not warn.
It must be an option for the logging module.

I found this (https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/utils/logging.py#L37):

def _setup_logger(name: Optional[str] = None):
    log = logging.getLogger(name)
    log.setLevel(os.environ.get("LOGLEVEL", get_log_level()))
    return log

But using

export LOGLEVEL=INFO

does not help.


(Maybe we have to ask them to give us machines on the same sub-network).

Will do this after the holiday.

@csukuangfj
Copy link
Collaborator Author

RE the log, how about implementing our own using print? We can print the log both to the console and to a file.

@pzelasko
Copy link
Collaborator

How will we handle Lhotse’s use of the logging module then (or any third-party lib that also uses logging)? Maybe we can reach out to somebody from PyTorch team to help us get to the source of this problem?

@francoishernandez
Copy link

Hey there,

A little slower than a single machine, as expected due to communication overhead.

Not sure if it would be relevant here, but have you thought of gradient accumulation to reduce the effect of communication overheads? (Fig. 2 in Scaling Neural Machine Translation)
We found that it was quite useful to reduce the impact of communication overheads when experimenting with multi-node setups with OpenNMT-py a while back.

@danpovey
Copy link
Collaborator

danpovey commented Oct 2, 2021

@francoishernandez That's interesting, but I think for our purposes the communication does not really dominate (since the models are not so huge), and we probably need to focus on core modeling stuff, and things like real-time decoding, for now.

@csukuangfj csukuangfj changed the title WIP: Support multi-node multi-GPU training. Support multi-node multi-GPU training. Oct 13, 2021
@csukuangfj
Copy link
Collaborator Author

Multi-node multi-GPU training is in general working, though it takes about 9 days to finish 50 epochs with 7 GPUs, a bit longer than single-node multi-GPU (4 GPUs), which is about 5 days.

The WER for multi-node multi-GPU training is comparable with that of single-node multi-GPU training.

The WER for this pull request is

  • test-clean: 2.57
  • test-other: 5.77
  • using --epoch 49, --avg 34.

Decoding log

Screen Shot 2021-10-13 at 12 47 08 PM

training log of the node with rank 0

Screen Shot 2021-10-13 at 12 48 11 PM

@danpovey
Copy link
Collaborator

Cool!
It's possible that the nodes you were using were not on the same sub-network or something like that, whic would explain the longer time. In future we can perhaps look for ways to make this faster, but I'd say for now let's focus on other priorities.

@csukuangfj
Copy link
Collaborator Author

Yes, I am writing decoding scripts for #54.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants