-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gensim error while loading Hebrew #1301
Comments
Looks a bit like #1236. The error message might seem to imply a problem with how fasttext produces the data. Whereas the description above used gensim 1.0.1, it also reproduces with gensym 2.0.0.
Notably this doesn't happen with all pretrained (Hebrew) embeddings created by fasttext. |
Has this issue been resolved? |
Hi @kewlcoder This is not an issue with gensim wrapper but an issue with the trained FastText model as there is a mismatch in vocab between .bin and .vec files. @prakhar2b , could you please investigate and raise an issue in FastText repo? Please use |
@tmylk @jayantj The mismatch is not only for the pretrained models released by facebook, but we are getting this error for all the models trained by fasttext, which was not the case earlier. Something might have changed in the fastText, looking into the code to see if it's intentional, then I'll raise an issue in the fastText repo. |
@prakhar2b Sounds good. |
About 8 days ago, fastText added two additional int32_t's to the .bin model header. These are "magic" and "version." As of right now, fasttext.py does not account for these integers when reading in model parameters causing it to be off by 2 integers for the rest of the reads. Take a look at the "checkModel" function in https://github.com/facebookresearch/fastText/blob/master/src/fasttext.cc to see what I'm talking about. EDIT: I have made some quick edits to fasttext.py that make it compatible with the latest version of fastText models. All I did was add additional reads so a plain model without quantization/dictionary pruning can be read. Please note that I didn't do any extensive testing so use at your own risk. |
@dgg5503 @tmylk @jayantj struct_unpack for models trained by fastText using text8 data -
I'm not sure why we have negative value for vocab_size. @jayantj (please comment on this). If this is undesirable, we should report it to fastText repo. Also, if this is an intentional mismatch (which seems to be), then there is no point in making any assert or warning statement. I think we should use vec and bin file separately for different purposes assuming that facebook's fastText code is working fine. This was also discussed for issue #1261 (improving fasttext loading time) while making comparision with salestock's fasttext loading mechanism which only used bin file for loading. |
@prakhar2b Are you sure about this? Looking at those values, it seems very likely to me that we are reading the wrong bytes for the values of Also, on a different note: the issue raised about the model trained on French wiki is quite old, before the FastText |
UPDATE : I've solved this. I'll submit a final PR asap. Thanks @dgg5503 for the suggestions. |
Description
Gensim error while loading Hebrew
Steps/Code/Corpus to Reproduce
Expected Results
Actual Results
AssertionError Traceback (most recent call last)
in ()
2
3 #num_dims = 300
----> 4 model = FastText.load_fasttext_format('wiki.he')
/home/vmadmin/anaconda3/lib/python3.5/site-packages/gensim/models/wrappers/fasttext.py in load_fasttext_format(cls, model_file)
236 model = cls()
237 model.wv = cls.load_word2vec_format('%s.vec' % model_file)
--> 238 model.load_binary_data('%s.bin' % model_file)
239 return model
240
/home/vmadmin/anaconda3/lib/python3.5/site-packages/gensim/models/wrappers/fasttext.py in load_binary_data(self, model_binary_file)
253 with utils.smart_open(model_binary_file, 'rb') as f:
254 self.load_model_params(f)
--> 255 self.load_dict(f)
256 self.load_vectors(f)
257
/home/vmadmin/anaconda3/lib/python3.5/site-packages/gensim/models/wrappers/fasttext.py in load_dict(self, file_handle)
274 (vocab_size, nwords, _) = self.struct_unpack(file_handle, '@3i')
275 # Vocab stored by Dictionary::save
--> 276 assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
277 assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
278 ntokens, = self.struct_unpack(file_handle, '@q')
AssertionError: mismatch between vocab sizes
Versions
Linux-4.4.0-75-generic-x86_64-with-debian-stretch-sid
Python 3.5.2 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.11.1
SciPy 0.18.1
gensim 1.0.1
FAST_VERSION 2
The text was updated successfully, but these errors were encountered: