Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is it possible to use some user or item embeddings with this library? #23

Open
hodaraad opened this issue Jun 29, 2018 · 3 comments
Open

Comments

@hodaraad
Copy link

The first paper says that you also tried adding additional embedding layer, but 1-Hot encoding resulted in better performance.

I wonder what kind of embedding you used for that experiment? I'm interested in 2 types of embeddings:
1- something similar to what LightFm uses for representing items by considering content-info about items to solve the cold-start problem for items instead of only representing items by their ids.
2- something similar to what TensorRec framework allows to implement which will be transforming original high dimensional vector of items to other linear or non-linear representations which will be of much lower dimensions and map similar items to similar points in the embedded space.

In particular, I'm wondering whether you had any memory/performance problems when dealing with those very big high-dimensional matrices as in the paper for video dataset, you had 330 thousand videos which will be a huge matrix when represented in 1-Hot encoding.

Thanks

@gds123
Copy link

gds123 commented Jun 30, 2018

I use the 2nd embedding and got a recall 0.44 mrr 0.16
What you performance in your experiment?

@hidasib
Copy link
Owner

hidasib commented Jul 2, 2018

@hodaraad There is an option for using embedding before the GRU layer. You can (1) either use the embedding=X parameter in the constructor to define item embedding of size X; or (2) set constrained_embedding=True to fix the input embedding to the output representation. This latter method is described in the second paper about GRU4Rec (https://arxiv.org/abs/1706.03847). The constrained embedding can improve results over the embedding-less setup, but it depends on the dataset. With the datasets I used, the standard embedding always performed slightly worse than either the embedding-less or constrained embedding setup.

Both of these setting make the network learn the embedding during training along with the session dynamics. You can not use pretrained embedding with this code without some modifications. I experimented with the pretrained approach (using both content based and CF pretrained embeddings), but it didn't improve the model.

Regarding one-hot encoding and memory consumption: if you use one-hot encoding, you basically do an indexing, you NEVER store your data as a big matrix of one one value per row and a bunch of zeros. So there is no additional memory requirement for the one-hot encoding approach besides keeping a map from the original item IDs of your data to 1...N, which is nothing compared to the network itself (less than 1MB for 330K items). Moreover, even if you use embedding you still need to have the indexing to feed the network with inputs.

The memory bottleneck is always are the weight matrices that are indexed by the items (Wx of GRU if there is no embedding; the embedding matrix (E) if there is embedding; and the output weight matrix (Wy) in all cases). The size of Wx is n_items x (3*first_layer); of E is n_items x embedding; of Wy is n_items x last_layer. Since constrained embedding only uses Wy, it requires the least amount of memory. The standard (embedding-less) setting requires 4 times more (as there is Wx, assuming that the size of the first and the last layers are equal). The basic embedding version requires 2 times more (assuming that the size of the embedding is the same as the size of the first layer).

Assuming 500K items, 100 as the first/last layer and embedding size and 32 bit floats, Wy uses up ~191MB of memory. During the training you also need a matrix of same size of the accumulated gradients (if you use adaptive learning methods, such as adagrad or adam) and another one for the velocity (if you use momentum). So constrained embedding uses ~572+X, standard setting uses ~2288+X, embedding setup uses ~1144+X megabytes of memory, where X is the rest of the network (negligible compared to the size of Wy), internal variables of Theano and the sample store used to speed up training on GPU (defined in the train function). The resulting model will be around 200 / 800 / 400 MB respectively.

Eliminating every item indexed matrix is possible in theory if the output of the network is not the predicted item, but the embedding of the predicted item (with L2 or cosine similarity loss on the output). However this approach is very inaccurate when it comes to top-N recommendation accuracy.

@KunlinY
Copy link

KunlinY commented Apr 14, 2019

Both of these setting make the network learn the embedding during training along with the session dynamics. You can not use pretrained embedding with this code without some modifications. I experimented with the pretrained approach (using both content based and CF pretrained embeddings), but it didn't improve the model.

Hi, can you provide the version with pretrained approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants