Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion between the --gpu flag and solverproto #925

Closed
Yangqing opened this issue Aug 14, 2014 · 10 comments
Closed

Confusion between the --gpu flag and solverproto #925

Yangqing opened this issue Aug 14, 2014 · 10 comments

Comments

@Yangqing
Copy link
Member

Currently in tools/caffe we can use --gpu to specify GPUs. However, during training, this flag is not used and instead the gpu id in the solver protobuffer is used. This might be quite confusing to the user and we should probably consider moving the training mode and training gpu id options outside the solver prototxt, or overriding it with the commandline option.

@Yangqing
Copy link
Member Author

(Thanks @yosinski for raising this in caffe-users!)

@shelhamer
Copy link
Member

Oh, my intention was to override the solver configuration when the --gpu
arg is given. I must have overlooked it.

On Wednesday, August 13, 2014, Yangqing Jia [email protected]
wrote:

(Thanks @yosinski https://github.com/yosinski for raising this in
caffe-users!)


Reply to this email directly or view it on GitHub
#925 (comment).

Evan Shelhamer

@Yangqing
Copy link
Member Author

yeah, the code is not there yet, although it is trivial to do - we just need to change the train() function so that the solver mode and gpu are rewritten.

In fact, I am thinking of marking the two solver flags as deprecated:

https://developers.google.com/protocol-buffers/docs/proto#options

so that we solely rely on the runtime to determine the mode rather than specifying things in the solver.

@longjon
Copy link
Contributor

longjon commented Aug 14, 2014

Yeah, I'm in favor of mode and device being specified on the command line only -- I was just irritated by this, and it's nice to have the solver prototxt be simply "the parameters needed for training", which are independent of device.

@yosinski
Copy link
Contributor

@longjon +1

@sguada
Copy link
Contributor

sguada commented Aug 14, 2014

I would prefer to keep device_id in prototxt and overwrite if needed, since
in the future one could define different device_id per layer for
parallelism.

Sergio

2014-08-14 8:30 GMT+02:00 Jason Yosinski [email protected]:

@longjon https://github.com/longjon +1


Reply to this email directly or view it on GitHub
#925 (comment).

@shelhamer
Copy link
Member

@sguada I agree we need a device_id field for the layers, but it isn't needed for the solver. I think it's alright to deprecate the solver message device_id. When we do multi-device parallelism I don't think we should hardcode an actual device ID into our model prototxt, but instead want to have the layers' device IDs refer to the order of the runtime args.

device: 0

in a prototxt would mean the first --gpu arg device ID.

@Yangqing
Copy link
Member Author

@sguada @shelhamer

If I may be so bold - I would actually vote against having device_id in the layer definition. The reason is probably the same as that for removing the device_id in the solver proto: device_id is really a scheduling thing and not a mathematical thing, so we probably should have the layers proto defining the network architecture, the solver proto defining the solver math, and a separate way (either in the form of commandline arguments, or a third (oh my) proto) defining how training should be carried out on device (including multi GPUs). In this way we hopefully won't mix things up at multiple places :)

@shelhamer
Copy link
Member

@Yangqing sure, I think we agree in spirit that the architecture and
parallelism should be defined distinctly. Whether they are both in the
model proto or a third "execution" proto doesn't inspire any strong
feelings for me.

I like your proposal for keeping it distinct, but it, but it may grow
complicated since the architecture and execution protos will have to refer
to each other to say where to do daa / model parallelism.

Details, details...

On Thursday, August 14, 2014, Yangqing Jia [email protected] wrote:

@sguada https://github.com/sguada @shelhamer
https://github.com/shelhamer

If I may be so bold - I would actually vote against having device_id in
the layer definition. The reason is probably the same as that for removing
the device_id in the solver proto: device_id is really a scheduling thing
and not a mathematical thing, so we probably should have the layers proto
defining the network architecture, the solver proto defining the solver
math, and a separate way (either in the form of commandline arguments, or a
third (oh my) proto) defining how training should be carried out on device
(including multi GPUs). In this way we hopefully won't mix things up at
multiple places :)


Reply to this email directly or view it on GitHub
#925 (comment).

@shelhamer
Copy link
Member

Fixed by command line override in #961.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants