-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multi-node multi-GPU training. #63
base: master
Are you sure you want to change the base?
Conversation
Looks like each node MUST provide equal number of GPUs for training. [EDITED]: This is not true for PyTorch 1.9.0, which uses elastic APIs for DDP. See https://pytorch.org/docs/stable/elastic/run.html |
Cool!
Is it significantly slower per minibatch, vs. just using one machine?
…On Thu, Sep 30, 2021 at 8:55 AM Fangjun Kuang ***@***.***> wrote:
Looks like each node *MUST* provide equal number of GPUs for training.
That is, you cannot have 1 GPU on node 1 and 2 GPUs on node 2.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#63 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO5AVOWDATTDGLNKAG3UEOYPBANCNFSM5FAM2TMA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
A little slower than a single machine, as expected due to communication overhead. The training speed depends on how the machines are connected. Single machine (base line)The following shows part of the training log for single-node multi-GPU (3 GPUs) training. You can see that it takes 7 to 8 seconds per 10 batches
Multi-node multi-gpu (first try)
I was using PyTorch 1.7.1. Unfortunately, there are no console logs, so we can only get the time statistics from the tensorboard log. Please see https://tensorboard.dev/experiment/8SxVEHZnSweuDW8rdxTw4g/#scalars&_smoothingWeight=0 You can see the timestamp from the tensorboard log. It takes about 4 to 5 minutes per 10 batches, I think one of the reasons is that the two cloud machines are not in the same region, so the communication overhead dominates the training time. Multi-node multi-gpu (second try)
I was using PyTorch 1.7.1, so there are still no console logs. The tensorboard log can be found at https://tensorboard.dev/experiment/ebNLdWt5S96HbxWb39kTdg/ You can see from the tensorboard log that it takes 8 to 9 seconds per 10 batches, which is only a bit slower Multi-node multi-gpu (third try)I have switched to PyTorch 1.9.0, which uses elastic launches (see https://pytorch.org/docs/stable/elastic/run.html).
There are still no console logs. The tensorboard log is available at You can see it takes about 8 to 9 seconds per 10 batches, comparable to the second try. (Note: All runs use the same arguments, i.e., --max-duration=200, --bucketing-samplers=1, so the training time is comparable) |
The following are the training logs using two machines. machine 1 (the master node)$ export CUDA_VISIBLE_DEVICES="0,1,2,3"
$ ./conformer_ctc/run-multi-node-multi-gpu.sh --master-addr 10.25.130.9 --master-port 12356 --node-rank 0 --num-nodes 2 --full-libri 1 Click to view the detailed log
It is strange that it prints the following log from
but machine 2$ export CUDA_VISIBLE_DEVICES="0,1,2"
$ ./conformer_ctc/run-multi-node-multi-gpu.sh --master-addr 10.25.130.9 --master-port 12356 --node-rank 1 --num-nodes 2 --full-libri 1 Click to view the detailed log
|
I manage to get the console log by changing It turns out the training speed is not stable. At the beginning of epoch 1, it takes about 40 seconds per 50 batches. See the log below
In later batches, the time needed doubles. See the log below.
|
Interesting.
We have to figure out how to set the console log-level to info, not warn.
It must be an option for the logging module.
Regarding the slowdown: my suspicion is that this is some kind of
throttling that comes from the cluster, intended to ensure QoS or something
like that.
(Maybe we have to ask them to give us machines on the same sub-network).
…On Thu, Sep 30, 2021 at 7:22 PM Fangjun Kuang ***@***.***> wrote:
I manage to get the console log by changing logging.info to
logging.warning.
It turns out the training speed is not stable.
At the beginning of epoch 1, it takes about 40 seconds per 50 batches. See
the log below
2021-09-30 17:34:05,265 WARNING [train.py:546] (0/7) Epoch 1, batch 0, batch avg ctc loss 0.4764, batch avg att loss 0.7291, batch avg loss 0.6533, total avg ctc loss: 0.4764, total avg att loss: 0.7291, total avg loss: 0.6533, batch size: 10
WARNING:root:Epoch 1, batch 50, batch avg ctc loss 0.4471, batch avg att loss 0.7007, batch avg loss 0.6246, total avg ctc loss: 0.4770, total avg att loss: 0.6918, total avg loss: 0.6274, batch size: 10
2021-09-30 17:34:45,024 WARNING [train.py:546] (0/7) Epoch 1, batch 50, batch avg ctc loss 0.4471, batch avg att loss 0.7007, batch avg loss 0.6246, total avg ctc loss: 0.4770, total avg att loss: 0.6918, total avg loss: 0.6274, batch size: 10
WARNING:root:Epoch 1, batch 100, batch avg ctc loss 0.5492, batch avg att loss 0.7107, batch avg loss 0.6623, total avg ctc loss: 0.4816, total avg att loss: 0.6920, total avg loss: 0.6289, batch size: 14
2021-09-30 17:35:25,843 WARNING [train.py:546] (0/7) Epoch 1, batch 100, batch avg ctc loss 0.5492, batch avg att loss 0.7107, batch avg loss 0.6623, total avg ctc loss: 0.4816, total avg att loss: 0.6920, total avg loss: 0.6289, batch size: 14
WARNING:root:Epoch 1, batch 150, batch avg ctc loss 0.4808, batch avg att loss 0.6649, batch avg loss 0.6097, total avg ctc loss: 0.48
49, total avg att loss: 0.6941, total avg loss: 0.6314, batch size: 8
2021-09-30 17:36:04,990 WARNING [train.py:546] (0/7) Epoch 1, batch 150, batch avg ctc loss 0.4808, batch avg att loss 0.6649, batch avg loss 0.6097, total avg ctc loss: 0.4849, total avg att loss: 0.6941, total avg loss: 0.6314, batch size: 8
WARNING:root:Epoch 1, batch 200, batch avg ctc loss 0.5185, batch avg att loss 0.7521, batch avg loss 0.6821, total avg ctc loss: 0.48
36, total avg att loss: 0.6958, total avg loss: 0.6321, batch size: 9
2021-09-30 17:36:46,014 WARNING [train.py:546] (0/7) Epoch 1, batch 200, batch avg ctc loss 0.5185, batch avg att loss 0.7521, batch avg loss 0.6821, total avg ctc loss: 0.4836, total avg att loss: 0.6958, total avg loss: 0.6321, batch size: 9
In later batches, the time needed *doubles*. See the log below.
2021-09-30 17:58:56,533 WARNING [train.py:546] (0/7) Epoch 1, batch 950, batch avg ctc loss 0.5290, batch avg att loss 0.7002, batch avg loss 0.6488, total avg ctc loss: 0.4581, total avg att loss: 0.6504, total avg loss: 0.5927, batch size: 15
WARNING:root:Epoch 1, batch 1000, batch avg ctc loss 0.3421, batch avg att loss 0.5315, batch avg loss 0.4747, total avg ctc loss: 0.4558, total avg att loss: 0.6478, total avg loss: 0.5902, batch size: 8
2021-09-30 18:01:25,202 WARNING [train.py:546] (0/7) Epoch 1, batch 1000, batch avg ctc loss 0.3421, batch avg att loss 0.5315, batch avg loss 0.4747, total avg ctc loss: 0.4558, total avg att loss: 0.6478, total avg loss: 0.5902, batch size: 8
WARNING:root:Epoch 1, batch 1050, batch avg ctc loss 0.3807, batch avg att loss 0.5738, batch avg loss 0.5159, total avg ctc loss: 0.4326, total avg att loss: 0.6181, total avg loss: 0.5625, batch size: 9
2021-09-30 18:03:47,349 WARNING [train.py:546] (0/7) Epoch 1, batch 1050, batch avg ctc loss 0.3807, batch avg att loss 0.5738, batch avg loss 0.5159, total avg ctc loss: 0.4326, total avg att loss: 0.6181, total avg loss: 0.5625, batch size: 9
WARNING:root:Epoch 1, batch 1100, batch avg ctc loss 0.5841, batch avg att loss 0.6984, batch avg loss 0.6641, total avg ctc loss: 0.4385, total avg att loss: 0.6205, total avg loss: 0.5659, batch size: 10
2021-09-30 18:05:30,333 WARNING [train.py:546] (0/7) Epoch 1, batch 1100, batch avg ctc loss 0.5841, batch avg att loss 0.6984, batch avg loss 0.6641, total avg ctc loss: 0.4385, total avg att loss: 0.6205, total avg loss: 0.5659, batch size: 10
WARNING:root:Epoch 1, batch 1150, batch avg ctc loss 0.3946, batch avg att loss 0.6428, batch avg loss 0.5683, total avg ctc loss: 0.4391, total avg att loss: 0.6164, total avg loss: 0.5632, batch size: 9
2021-09-30 18:07:17,134 WARNING [train.py:546] (0/7) Epoch 1, batch 1150, batch avg ctc loss 0.3946, batch avg att loss 0.6428, batch avg loss 0.5683, total avg ctc loss: 0.4391, total avg att loss: 0.6164, total avg loss: 0.5632, batch size: 9
WARNING:root:Epoch 1, batch 1200, batch avg ctc loss 0.4343, batch avg att loss 0.5765, batch avg loss 0.5339, total avg ctc loss: 0.4409, total avg att loss: 0.6142, total avg loss: 0.5622, batch size: 9
2021-09-30 18:09:12,466 WARNING [train.py:546] (0/7) Epoch 1, batch 1200, batch avg ctc loss 0.4343, batch avg att loss 0.5765, batch avg loss 0.5339, total avg ctc loss: 0.4409, total avg att loss: 0.6142, total avg loss: 0.5622, batch size: 9
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#63 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOZHICQVSZ75BAJ5GF3UERB6RANCNFSM5FAM2TMA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I found this (https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/utils/logging.py#L37): def _setup_logger(name: Optional[str] = None):
log = logging.getLogger(name)
log.setLevel(os.environ.get("LOGLEVEL", get_log_level()))
return log But using export LOGLEVEL=INFO does not help.
Will do this after the holiday. |
RE the log, how about implementing our own using |
How will we handle Lhotse’s use of the |
Hey there,
Not sure if it would be relevant here, but have you thought of gradient accumulation to reduce the effect of communication overheads? (Fig. 2 in Scaling Neural Machine Translation) |
@francoishernandez That's interesting, but I think for our purposes the communication does not really dominate (since the models are not so huge), and we probably need to focus on core modeling stuff, and things like real-time decoding, for now. |
Multi-node multi-GPU training is in general working, though it takes about 9 days to finish 50 epochs with 7 GPUs, a bit longer than single-node multi-GPU (4 GPUs), which is about 5 days. The WER for multi-node multi-GPU training is comparable with that of single-node multi-GPU training. The WER for this pull request is
Decoding logtraining log of the node with rank 0 |
Cool! |
Yes, I am writing decoding scripts for #54. |
Test with two machines, each with 3 GPUs.
Machine 1
Click to view the log
Machine 2
Click to view the log