Skip to content
This repository has been archived by the owner on May 24, 2018. It is now read-only.

Readable model dump. #43

Open
BaiGang opened this issue Nov 18, 2015 · 17 comments
Open

Readable model dump. #43

BaiGang opened this issue Nov 18, 2015 · 17 comments

Comments

@BaiGang
Copy link

BaiGang commented Nov 18, 2015

Hi,

Currently all learning methods in wormhole save resulted models in binary format. This is pretty well in cases of solving machine learning competitions, i.e training and predicting both using wormhole components. However in more general cases when we train the models offline and want to apply them in an online component (in our case it's a server running on JVM), the binary format results in some inconvenience. So a readable model output in text format (or other exchangeable format such as protobuf) is highly expected.

Thanks,
Gang

@BaiGang
Copy link
Author

BaiGang commented Nov 20, 2015

I address the readable dump of DiFacto model by parsing the binary file saved via SaveModel, i.e Save in KVStore and IVal AdaGradEntry in DiFacto.

Ideally we can abstract the Entry data and the internal storage in KVStore using protobuf. This will make io implementations neat and make our model results exchangeable in various language and platforms.

@BaiGang
Copy link
Author

BaiGang commented Nov 20, 2015

So my proposal above is mainly related to ps-lite. I'll try it out and make a WIP pull request there.

@mli
Copy link
Member

mli commented Nov 28, 2015

yeah, that's good suggestion.

i'll add a tool to convert the binary model into an ascii format.

at the same time, i'm trying to refact fm into a separate repo called dmlc/difacto, with two major changes

  1. having a single machine multiple threads implementation, which should process data <100GB easily on a single machine. and also will be easy to have python/R bindings
  2. switch to the dev branch of ps-lite, which is a simplified version of the master branch. mxnet is using it now and it works well

i hope to get it done in a week.

@CNevd
Copy link
Contributor

CNevd commented Nov 29, 2015

Very nice, Look forward to the changes :)

@BaiGang
Copy link
Author

BaiGang commented Nov 30, 2015

Thanks and looking forward to the changes. : )

@BaiGang
Copy link
Author

BaiGang commented Dec 30, 2015

Any update on this?

I'm also interested in the refactor of ps-lite. It has no update for two months. So is it finalized?

@formath
Copy link

formath commented Jul 1, 2016

@BaiGang "I address the readable dump of DiFacto model by parsing the binary file saved via SaveModel". Can you share me the parsing method? Thanks.

@CNevd
Copy link
Contributor

CNevd commented Jul 1, 2016

see dump.cc

@toughJack
Copy link

@BaiGang @mli
When I dump the model to text format, I found original feature ids are converted into new ids (large numbers). If I want to keep the original feature ids in model, how do I make it work?
Thanks!

@mli
Copy link
Member

mli commented Aug 24, 2016

there is a revert key id function, I guess it is called in the data reader
On Wed, Aug 24, 2016 at 3:37 AM Xiaoqiang Feng [email protected]
wrote:

@BaiGang https://github.com/BaiGang @mli https://github.com/mli

When I dump the model to text format, I found original feature ids are
converted into new ids (large numbers). If I want to keep the original
feature ids in model, how do I make it work?
Thanks!


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#43 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAZv4fOh8TDMC4sKo4x5hG9lwtbN_BU8ks5qjB8GgaJpZM4GkXpd
.

@formath
Copy link

formath commented Aug 25, 2016

@toughJack Maybe you should change code in localizer.h like this.

else if (sizeof(I) == 8) {
#pragma omp parallel for num_threads(nt_)
    for (size_t i = 0; i < idx_size; ++i) {
      //pair_[i].k = ReverseBytes(blk.index[i]);
      pair_[i].k = blk.index[i];
      pair_[i].i = i;
    }

@CNevd
Copy link
Contributor

CNevd commented Aug 25, 2016

@formath @toughJack see issues/8
just comment //pair_[i].k = ReverseBytes(blk.index[i]); will make ranges of servers imbalanced if your max key is small

@mli
Copy link
Member

mli commented Aug 25, 2016

you manually set the max_key, so the servers will only partition that key
range
On Wed, Aug 24, 2016 at 9:41 PM CNevd [email protected] wrote:

@formath https://github.com/formath @toughJack
https://github.com/toughJack see issues/8
CNevd/Difacto_DMLC#8
just comment //pair_[i].k = ReverseBytes(blk.index[i]); will make ranges
of servers imbalanced if your max key is small


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#43 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAZv4Z4OM_YImOreDvnID5CcrS-tfAyHks5qjRz-gaJpZM4GkXpd
.

@CNevd
Copy link
Contributor

CNevd commented Aug 25, 2016

@mli yes:)

@formath
Copy link

formath commented Aug 25, 2016

@CNevd Good suggestion. I always generate balanced uint64 feature id offline, so miss that. If max key is small, setting max_key is truly right.

@toughJack
Copy link

@mli
I noticed that you mentioned single machine multiple threads implementation of FM.
"1. having a single machine multiple threads implementation, which should process data <100GB easily on a single machine. and also will be easy to have python/R bindings"
I did not find any manual for single machine multiple threads version.
I wonder whether it works ? If it works, how to set the relative parameters and run?
Thanks

@mli
Copy link
Member

mli commented Aug 25, 2016

  1. just run multiple workers on the same machine
  2. try to use lbfgs implemented on dmlc/difacto

On Thu, Aug 25, 2016 at 2:11 AM, Xiaoqiang Feng [email protected]
wrote:

@mli https://github.com/mli
I noticed that you mentioned single machine multiple threads
implementation of FM.
"1. having a single machine multiple threads implementation, which should
process data <100GB easily on a single machine. and also will be easy to
have python/R bindings"
I did not find any manual for single machine multiple threads version.
I wonder whether it works ? If it works, how to set the relative
parameters and run?
Thanks


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#43 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAZv4RX6j348wdvN1PUh2jIk4NMfh79Kks5qjVxHgaJpZM4GkXpd
.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants