Seems Word2VecKeyedVectors.get_keras_embedding should take care of 'mask_zero' #1900

uZeroJ · 2018-02-14T02:00:53Z

Description

As described in Groups, create this issue for tracking, thanks Ivan for a quick view.
Google Groups tracking

This is a question about combining gensim with Keras.
As KeyedVectors.vectors as start with index of '0', then the word indices with index '0' should be valid word to obtain a word vector from embedding layer. code line
And Keras Embedding layer do provide mask_zero to have eyes on padding '0'(Keras Embedding).
And as shown in tutorials of Keras blog , they do provide a embedding matrix to Embedding layer with index starting from '1' and input_dim=len(vocab) + 1.
Thus, if we get_keras_embedding from Word2VecKeyedVectors, we'll take all '0' padding as the first word, or missing all the first word if we set mask_zero=True.
So I smell bug here.
And changing the behavior of get_keras_embedding may also need to change model.index2word or KeyedVectors.vocab.

I know there are many ways to work around this issue, such as manually pad with other values instead of '0' or manually build a Keras Embedding layer. Then what if we could do to both take the advantage of get_keras_embedding and pad_sequeneces in Keras?

Thanks!

Steps/Code/Corpus to Reproduce

import os

import numpy as np
from gensim.models.keyedvectors import KeyedVectors
from keras.layers import Embedding, Input
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
import keras.backend as K

emb_file = 'gensim.txt'
""" Content in gensim.txt
2 3
the 0 1 2
and 3 4 5
"""
kvecs = KeyedVectors.load_word2vec_format(emb_file)
#print(kvecs.word_index)
#print(kvecs.vectors)
#print(kvecs.syn0)
emb_layer = kvecs.get_keras_embedding()


emb_weights = np.arange((2*3)).reshape((2, 3))
inputs = np.array([0, 1, 1]).reshape((1, 3))
inputs_pad = pad_sequences(inputs, maxlen=10)
inp=Input((10,))
#out=Embedding(input_dim=2, output_dim=3, weights=[emb_weights])(inp)
out=emb_layer(inp)
model = Model(inp, out)
print(model.predict(inputs_pad))

Actual Results

[[[0. 1. 2.]
  [0. 1. 2.]
  [0. 1. 2.]
  [0. 1. 2.]
  [0. 1. 2.]
  [0. 1. 2.]
  [0. 1. 2.]
  [0. 1. 2.]
  [3. 4. 5.]
  [3. 4. 5.]]]

Expected Results

And the expected result maybe

[[[0. 0. 0.]
  [0. 0. 0.]
  [0. 0. 0.]
  [0. 0. 0.]
  [0. 0. 0.]
  [0. 0.0.]
  [0. 0. 0.]
  [0. 1. 2.]
  [3. 4. 5.]
  [3. 4. 5.]]]

Versions

Darwin-16.7.0-x86_64-i386-64bit
Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:04:09)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.14.0
SciPy 1.0.0
gensim 3.3.0
FAST_VERSION 0
(Tough have nothing to do with the environment but do need gensim >= 3.3.0)

The text was updated successfully, but these errors were encountered:

uZeroJ · 2018-02-14T02:30:15Z

And besides, I wonder if we have a highway connection between words and indexes when only KeyedVectors are got, no Word2Vec model involved. As far as I know, we have to get the index by word by self.vectors.vocab['word'].index though get the word from index is just easy as self.vectors.index2word.
If what I understand is true, then we surely broken the connection of word and index into two parts, as vocab wrapped the so called word2index. Thus, we could not change then in a atom work. Though we query word from index is more frequently when training our model.

Above is just a point of understand of mine, please correct me and I'd really appreciate it.

Thanks!

menshikh-iv · 2018-02-14T07:09:43Z

Thanks for report @uZeroJ!

CC: @chinmayapancholi13

aneesh-joshi · 2018-02-14T10:42:46Z

I would like to try this.

aneesh-joshi · 2018-02-15T12:23:01Z

As I understand it,
the gensim KeyedVectors indexes words from 0

While using keras, we can set mask_zero=True
As a result, 0 doesn't represent words anymore but gensim KeyedVectors continues to treat it like that.

@menshikh-iv Do you think it's a good idea to use
a mask_zero for keyedvectors to start indices from 1?

uZeroJ · 2018-02-16T00:30:08Z

Hi @aneesh-joshi ,

As a result, 0 doesn't represent words anymore but gensim KeyedVectors continues to treat it like that.

As far as I know, this solution will mask all actually tokens represented by 0, like 'the' in Keras embedding layer. Say, 'the' is indexed by '0' in gensim KeyedVectors.vectors and we get Embedding layer as passing this KeyedVectors.vectors as embedding matrix, then when we want lookup 'the' from its index in input indices, it will be masked out by setting mask_zero=True.
So maybe we need to pad the KeyedVectors.vectors when passing it to get_keras_layer and provide another function for looking up indices from word.
That's my opinion, thanks!

uZeroJ · 2018-02-16T00:41:33Z

Maybe something looks like below will explain what I mean, and keep the original framework of gensim KeyedVectors in my opinion.

def get_keras_layer:
     weights = np.vstack(np.zeros((weights.shape[1], )).reshape(1, -1), weights)
     layer = Embedding(
            input_dim=weights.shape[0] , output_dim=weights.shape[1],
            weights=[weights], trainable=train_embeddings
        )

And

def get_keras_indices_lookup:
     # An extra loolkup from word to indices
    lookup = {self.vectors.vocab[word].index + 1: word for word in self.vectors.vocab}
    return lookup

eromoe · 2018-03-29T06:31:38Z

I think simply add

if mask_zero == True:
    model.index2word = [None]+ model.index2word

kuraga · 2018-06-15T13:45:21Z

any news?

piskvorky · 2020-10-06T21:50:00Z

get_keras_embedding was moved to wiki in #2937. See https://github.com/RaRe-Technologies/gensim/wiki/Using-Gensim-Embeddings-with-Keras-and-Tensorflow for how the new Gensim 4.0 code handles the word-to-index mapping.

menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Feb 14, 2018

piskvorky closed this as completed Oct 6, 2020

mpenkov mentioned this issue Oct 28, 2020

Update changelog for 4.0.0 release #2981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seems Word2VecKeyedVectors.get_keras_embedding should take care of 'mask_zero' #1900

Seems Word2VecKeyedVectors.get_keras_embedding should take care of 'mask_zero' #1900

uZeroJ commented Feb 14, 2018 •

edited by menshikh-iv

Loading

uZeroJ commented Feb 14, 2018

menshikh-iv commented Feb 14, 2018

aneesh-joshi commented Feb 14, 2018

aneesh-joshi commented Feb 15, 2018

uZeroJ commented Feb 16, 2018 •

edited

Loading

uZeroJ commented Feb 16, 2018

eromoe commented Mar 29, 2018 •

edited

Loading

kuraga commented Jun 15, 2018

piskvorky commented Oct 6, 2020 •

edited

Loading

Seems Word2VecKeyedVectors.get_keras_embedding should take care of 'mask_zero' #1900

Seems Word2VecKeyedVectors.get_keras_embedding should take care of 'mask_zero' #1900

Comments

uZeroJ commented Feb 14, 2018 • edited by menshikh-iv Loading

Description

Steps/Code/Corpus to Reproduce

Actual Results

Expected Results

Versions

uZeroJ commented Feb 14, 2018

menshikh-iv commented Feb 14, 2018

aneesh-joshi commented Feb 14, 2018

aneesh-joshi commented Feb 15, 2018

uZeroJ commented Feb 16, 2018 • edited Loading

uZeroJ commented Feb 16, 2018

eromoe commented Mar 29, 2018 • edited Loading

kuraga commented Jun 15, 2018

piskvorky commented Oct 6, 2020 • edited Loading

uZeroJ commented Feb 14, 2018 •

edited by menshikh-iv

Loading

uZeroJ commented Feb 16, 2018 •

edited

Loading

eromoe commented Mar 29, 2018 •

edited

Loading

piskvorky commented Oct 6, 2020 •

edited

Loading