Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Gluon Trainer fails with Horovod DistributedTrainer #15260

Closed
chandana1332 opened this issue Jun 17, 2019 · 5 comments
Closed

Gluon Trainer fails with Horovod DistributedTrainer #15260

chandana1332 opened this issue Jun 17, 2019 · 5 comments
Labels

Comments

@chandana1332
Copy link
Contributor

chandana1332 commented Jun 17, 2019

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

I'm training with a gluon Trainer as opposed to the MxNet optimizer used in the MNIST example.

When I use a gluon Trainer, I get the following error:

Traceback (most recent call last):
  File "hvd_ner.py", line 334, in <module>
    test(args)
  File "hvd_ner.py", line 313, in test
    ner.run()
  File "hvd_ner.py", line 295, in run
    self._train_horovod(train_data_loader)
  File "hvd_ner.py", line 204, in _train_horovod
    self.hvd_trainer = hvd.DistributedTrainer(params, self.trainer)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/__init__.py", line 91, in __init__
    params, optimizer, optimizer_params=optimizer_params, kvstore=None)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/trainer.py", line 91, in __init__
    self._init_optimizer(optimizer, optimizer_params)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/trainer.py", line 121, in _init_optimizer
    **optimizer_params)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/optimizer/optimizer.py", line 176, in create_optimizer
    if name.lower() in Optimizer.opt_registry:
AttributeError: 'Trainer' object has no attribute 'lower'

I debugged it a little more. I printed the variable 'name' at line 176 in mxnet/optimizer/optimizer.py
When I use a gluon Trainer, the value of the variable is the gluon Trainer object
when I use an MxNet optimizer , the value of the variable is 'sgd'

Environment info

Framework: MxNet

Framework version:
----------MXNet Info-----------
Version : 1.5.0
Directory : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit Hash : 62a85f3

Horovod version: 0.16.4

MPI version: 3.1.0

CUDA version: 10.0

NCCL version:

Python version:
----------Python Info----------
Version : 3.6.5
Compiler : GCC 7.2.0
Build : ('default', 'Apr 29 2018 16:14:56')
Arch : ('64bit', '')

OS and version:
--------System Info----------
Platform : Linux-4.4.0-1081-aws-x86_64-with-debian-stretch-sid
system : Linux
node : ip-172-31-63-186
release : 4.4.0-1081-aws
version : #91-Ubuntu SMP Tue Apr 16 08:21:03 UTC 2019

GCC version: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Gluon, Bug

@yuxihu
Copy link
Member

yuxihu commented Jun 17, 2019

This is not the correct way of using DistributedTrainer. DistributedTrainer is a subclass of Gluon Trainer which takes MXNet optimizer as an argument. You cannot pass a Gluon Trainer to DistributedTrainer.

@leleamol
Copy link
Contributor

@mxnet-label-bot add [Gluon]

@yuxihu
Copy link
Member

yuxihu commented Jun 18, 2019

@chandana1332 Please reopen the issue if you have any additional questions.

@chandana1332
Copy link
Contributor Author

@yuxihu thank you! That worked.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants