Gluon Trainer fails with Horovod DistributedTrainer #15260

chandana1332 · 2019-06-17T21:24:22Z

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

I'm training with a gluon Trainer as opposed to the MxNet optimizer used in the MNIST example.

When I use a gluon Trainer, I get the following error:

Traceback (most recent call last):
  File "hvd_ner.py", line 334, in <module>
    test(args)
  File "hvd_ner.py", line 313, in test
    ner.run()
  File "hvd_ner.py", line 295, in run
    self._train_horovod(train_data_loader)
  File "hvd_ner.py", line 204, in _train_horovod
    self.hvd_trainer = hvd.DistributedTrainer(params, self.trainer)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/__init__.py", line 91, in __init__
    params, optimizer, optimizer_params=optimizer_params, kvstore=None)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/trainer.py", line 91, in __init__
    self._init_optimizer(optimizer, optimizer_params)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/trainer.py", line 121, in _init_optimizer
    **optimizer_params)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/optimizer/optimizer.py", line 176, in create_optimizer
    if name.lower() in Optimizer.opt_registry:
AttributeError: 'Trainer' object has no attribute 'lower'

I debugged it a little more. I printed the variable 'name' at line 176 in mxnet/optimizer/optimizer.py
When I use a gluon Trainer, the value of the variable is the gluon Trainer object
when I use an MxNet optimizer , the value of the variable is 'sgd'

Environment info

Framework: MxNet

Framework version:
----------MXNet Info-----------
Version : 1.5.0
Directory : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit Hash : 62a85f3

Horovod version: 0.16.4

MPI version: 3.1.0

CUDA version: 10.0

NCCL version:

Python version:
----------Python Info----------
Version : 3.6.5
Compiler : GCC 7.2.0
Build : ('default', 'Apr 29 2018 16:14:56')
Arch : ('64bit', '')

OS and version:
--------System Info----------
Platform : Linux-4.4.0-1081-aws-x86_64-with-debian-stretch-sid
system : Linux
node : ip-172-31-63-186
release : 4.4.0-1081-aws
version : #91-Ubuntu SMP Tue Apr 16 08:21:03 UTC 2019

GCC version: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-06-17T21:24:26Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Gluon, Bug

yuxihu · 2019-06-17T21:34:02Z

This is not the correct way of using DistributedTrainer. DistributedTrainer is a subclass of Gluon Trainer which takes MXNet optimizer as an argument. You cannot pass a Gluon Trainer to DistributedTrainer.

leleamol · 2019-06-18T00:22:32Z

@mxnet-label-bot add [Gluon]

yuxihu · 2019-06-18T16:49:23Z

@chandana1332 Please reopen the issue if you have any additional questions.

chandana1332 · 2019-06-18T16:51:54Z

@yuxihu thank you! That worked.

marcoabreu added the Gluon label Jun 18, 2019

yuxihu closed this as completed Jun 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gluon Trainer fails with Horovod DistributedTrainer #15260

Gluon Trainer fails with Horovod DistributedTrainer #15260

chandana1332 commented Jun 17, 2019 •

edited

Loading

mxnet-label-bot commented Jun 17, 2019

yuxihu commented Jun 17, 2019

leleamol commented Jun 18, 2019

yuxihu commented Jun 18, 2019

chandana1332 commented Jun 18, 2019

Gluon Trainer fails with Horovod DistributedTrainer #15260

Gluon Trainer fails with Horovod DistributedTrainer #15260

Comments

chandana1332 commented Jun 17, 2019 • edited Loading

Description

Environment info

Build info (Required if built from source)

mxnet-label-bot commented Jun 17, 2019

yuxihu commented Jun 17, 2019

leleamol commented Jun 18, 2019

yuxihu commented Jun 18, 2019

chandana1332 commented Jun 18, 2019

chandana1332 commented Jun 17, 2019 •

edited

Loading