Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seems Word2VecKeyedVectors.get_keras_embedding should take care of 'mask_zero' #1900

Closed
uZeroJ opened this issue Feb 14, 2018 · 9 comments
Closed
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills

Comments

@uZeroJ
Copy link

uZeroJ commented Feb 14, 2018

Description

As described in Groups, create this issue for tracking, thanks Ivan for a quick view.
Google Groups tracking

This is a question about combining gensim with Keras.
As KeyedVectors.vectors as start with index of '0', then the word indices with index '0' should be valid word to obtain a word vector from embedding layer. code line
And Keras Embedding layer do provide mask_zero to have eyes on padding '0'(Keras Embedding).
And as shown in tutorials of Keras blog , they do provide a embedding matrix to Embedding layer with index starting from '1' and input_dim=len(vocab) + 1.
Thus, if we get_keras_embedding from Word2VecKeyedVectors, we'll take all '0' padding as the first word, or missing all the first word if we set mask_zero=True.
So I smell bug here.
And changing the behavior of get_keras_embedding may also need to change model.index2word or KeyedVectors.vocab.

I know there are many ways to work around this issue, such as manually pad with other values instead of '0' or manually build a Keras Embedding layer. Then what if we could do to both take the advantage of get_keras_embedding and pad_sequeneces in Keras?

Thanks!

Steps/Code/Corpus to Reproduce

import os

import numpy as np
from gensim.models.keyedvectors import KeyedVectors
from keras.layers import Embedding, Input
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
import keras.backend as K

emb_file = 'gensim.txt'
""" Content in gensim.txt
2 3
the 0 1 2
and 3 4 5
"""
kvecs = KeyedVectors.load_word2vec_format(emb_file)
#print(kvecs.word_index)
#print(kvecs.vectors)
#print(kvecs.syn0)
emb_layer = kvecs.get_keras_embedding()


emb_weights = np.arange((2*3)).reshape((2, 3))
inputs = np.array([0, 1, 1]).reshape((1, 3))
inputs_pad = pad_sequences(inputs, maxlen=10)
inp=Input((10,))
#out=Embedding(input_dim=2, output_dim=3, weights=[emb_weights])(inp)
out=emb_layer(inp)
model = Model(inp, out)
print(model.predict(inputs_pad))

Actual Results

[[[0. 1. 2.]
  [0. 1. 2.]
  [0. 1. 2.]
  [0. 1. 2.]
  [0. 1. 2.]
  [0. 1. 2.]
  [0. 1. 2.]
  [0. 1. 2.]
  [3. 4. 5.]
  [3. 4. 5.]]]

Expected Results

And the expected result maybe

[[[0. 0. 0.]
  [0. 0. 0.]
  [0. 0. 0.]
  [0. 0. 0.]
  [0. 0. 0.]
  [0. 0.0.]
  [0. 0. 0.]
  [0. 1. 2.]
  [3. 4. 5.]
  [3. 4. 5.]]]

Versions

Darwin-16.7.0-x86_64-i386-64bit
Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:04:09)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.14.0
SciPy 1.0.0
gensim 3.3.0
FAST_VERSION 0
(Tough have nothing to do with the environment but do need gensim >= 3.3.0)
@uZeroJ
Copy link
Author

uZeroJ commented Feb 14, 2018

And besides, I wonder if we have a highway connection between words and indexes when only KeyedVectors are got, no Word2Vec model involved. As far as I know, we have to get the index by word by self.vectors.vocab['word'].index though get the word from index is just easy as self.vectors.index2word.
If what I understand is true, then we surely broken the connection of word and index into two parts, as vocab wrapped the so called word2index. Thus, we could not change then in a atom work. Though we query word from index is more frequently when training our model.

Above is just a point of understand of mine, please correct me and I'd really appreciate it.

Thanks!

@menshikh-iv
Copy link
Contributor

Thanks for report @uZeroJ!

CC: @chinmayapancholi13

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Feb 14, 2018
@aneesh-joshi
Copy link
Contributor

I would like to try this.

@aneesh-joshi
Copy link
Contributor

As I understand it,
the gensim KeyedVectors indexes words from 0

While using keras, we can set mask_zero=True
As a result, 0 doesn't represent words anymore but gensim KeyedVectors continues to treat it like that.

@menshikh-iv Do you think it's a good idea to use
a mask_zero for keyedvectors to start indices from 1?

@uZeroJ
Copy link
Author

uZeroJ commented Feb 16, 2018

Hi @aneesh-joshi ,

As a result, 0 doesn't represent words anymore but gensim KeyedVectors continues to treat it like that.

As far as I know, this solution will mask all actually tokens represented by 0, like 'the' in Keras embedding layer. Say, 'the' is indexed by '0' in gensim KeyedVectors.vectors and we get Embedding layer as passing this KeyedVectors.vectors as embedding matrix, then when we want lookup 'the' from its index in input indices, it will be masked out by setting mask_zero=True.
So maybe we need to pad the KeyedVectors.vectors when passing it to get_keras_layer and provide another function for looking up indices from word.
That's my opinion, thanks!

@uZeroJ
Copy link
Author

uZeroJ commented Feb 16, 2018

Maybe something looks like below will explain what I mean, and keep the original framework of gensim KeyedVectors in my opinion.

def get_keras_layer:
     weights = np.vstack(np.zeros((weights.shape[1], )).reshape(1, -1), weights)
     layer = Embedding(
            input_dim=weights.shape[0] , output_dim=weights.shape[1],
            weights=[weights], trainable=train_embeddings
        )

And

def get_keras_indices_lookup:
     # An extra loolkup from word to indices
    lookup = {self.vectors.vocab[word].index + 1: word for word in self.vectors.vocab}
    return lookup

@eromoe
Copy link

eromoe commented Mar 29, 2018

I think simply add

if mask_zero == True:
    model.index2word = [None]+ model.index2word

@kuraga
Copy link

kuraga commented Jun 15, 2018

any news?

@piskvorky
Copy link
Owner

piskvorky commented Oct 6, 2020

get_keras_embedding was moved to wiki in #2937. See https://github.com/RaRe-Technologies/gensim/wiki/Using-Gensim-Embeddings-with-Keras-and-Tensorflow for how the new Gensim 4.0 code handles the word-to-index mapping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills
Projects
None yet
Development

No branches or pull requests

6 participants