You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At least using xlnet model. When using high max_len, it doesn't print any error just crashes. Training with 1 GPU works well. When setting low max_len I get the error below. I'm using 4 Nvidia V100.
Traceback (most recent call last):
File "src/train.py", line 830, in <module>
main()
File "src/train.py", line 690, in main
train_step(dummy_batch)
File "src/train.py", line 566, in train_step
loss, acc, ppl = forward_step(batch)
File "src/train.py", line 556, in forward_step
acc = reduce_tensor(acc)
File "src/train.py", line 530, in reduce_tensor
reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'
Traceback (most recent call last):
File "src/train.py", line 830, in <module>
main()
File "src/train.py", line 690, in main
train_step(dummy_batch)
File "src/train.py", line 566, in train_step
loss, acc, ppl = forward_step(batch)
File "src/train.py", line 556, in forward_step
acc = reduce_tensor(acc)
File "src/train.py", line 530, in reduce_tensor
reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'
Traceback (most recent call last):
File "src/train.py", line 830, in <module>
Traceback (most recent call last):
File "src/train.py", line 830, in <module>
main()
File "src/train.py", line 690, in main
main()
File "src/train.py", line 690, in main
train_step(dummy_batch)
train_step(dummy_batch)
File "src/train.py", line 566, in train_step
File "src/train.py", line 566, in train_step
loss, acc, ppl = forward_step(batch)
File "src/train.py", line 556, in forward_step
loss, acc, ppl = forward_step(batch)
File "src/train.py", line 556, in forward_step
acc = reduce_tensor(acc)
File "src/train.py", line 530, in reduce_tensor
acc = reduce_tensor(acc)
File "src/train.py", line 530, in reduce_tensor
reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'
reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'
The text was updated successfully, but these errors were encountered:
Still not working... :(
Reproducible with python -m torch.distributed.launch --nproc_per_node=8 src/train.py --config src/configs/gpt2-dailydialog.json on AWS p3dn.24xlarge (8 volta v100). The program just crashes... Works on 1 GPU tho.
At least using xlnet model. When using high max_len, it doesn't print any error just crashes. Training with 1 GPU works well. When setting low max_len I get the error below. I'm using 4 Nvidia V100.
The text was updated successfully, but these errors were encountered: