-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seems Word2VecKeyedVectors.get_keras_embedding should take care of 'mask_zero' #1900
Comments
And besides, I wonder if we have a highway connection between words and indexes when only KeyedVectors are got, no Word2Vec model involved. As far as I know, we have to get the index by word by Above is just a point of understand of mine, please correct me and I'd really appreciate it. Thanks! |
Thanks for report @uZeroJ! |
I would like to try this. |
As I understand it, While using keras, we can set @menshikh-iv Do you think it's a good idea to use |
Hi @aneesh-joshi ,
As far as I know, this solution will mask all actually tokens represented by 0, like 'the' in Keras embedding layer. Say, 'the' is indexed by '0' in gensim |
Maybe something looks like below will explain what I mean, and keep the original framework of gensim
And
|
I think simply add
|
any news? |
|
Description
As described in Groups, create this issue for tracking, thanks Ivan for a quick view.
Google Groups tracking
This is a question about combining gensim with Keras.
As KeyedVectors.vectors as start with index of '0', then the word indices with index '0' should be valid word to obtain a word vector from embedding layer. code line
And Keras Embedding layer do provide mask_zero to have eyes on padding '0'(Keras Embedding).
And as shown in tutorials of Keras blog , they do provide a embedding matrix to Embedding layer with index starting from '1' and input_dim=len(vocab) + 1.
Thus, if we get_keras_embedding from Word2VecKeyedVectors, we'll take all '0' padding as the first word, or missing all the first word if we set mask_zero=True.
So I smell bug here.
And changing the behavior of get_keras_embedding may also need to change model.index2word or KeyedVectors.vocab.
I know there are many ways to work around this issue, such as manually pad with other values instead of '0' or manually build a Keras Embedding layer. Then what if we could do to both take the advantage of get_keras_embedding and pad_sequeneces in Keras?
Thanks!
Steps/Code/Corpus to Reproduce
Actual Results
Expected Results
And the expected result maybe
Versions
The text was updated successfully, but these errors were encountered: