Gensim error while loading Hebrew #1301

guybartal · 2017-05-03T21:45:48Z

Description

Gensim error while loading Hebrew

Steps/Code/Corpus to Reproduce

from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format('wiki.he')

Expected Results

Actual Results

AssertionError Traceback (most recent call last)
in ()
2
3 #num_dims = 300
----> 4 model = FastText.load_fasttext_format('wiki.he')

/home/vmadmin/anaconda3/lib/python3.5/site-packages/gensim/models/wrappers/fasttext.py in load_fasttext_format(cls, model_file)
236 model = cls()
237 model.wv = cls.load_word2vec_format('%s.vec' % model_file)
--> 238 model.load_binary_data('%s.bin' % model_file)
239 return model
240

/home/vmadmin/anaconda3/lib/python3.5/site-packages/gensim/models/wrappers/fasttext.py in load_binary_data(self, model_binary_file)
253 with utils.smart_open(model_binary_file, 'rb') as f:
254 self.load_model_params(f)
--> 255 self.load_dict(f)
256 self.load_vectors(f)
257

/home/vmadmin/anaconda3/lib/python3.5/site-packages/gensim/models/wrappers/fasttext.py in load_dict(self, file_handle)
274 (vocab_size, nwords, _) = self.struct_unpack(file_handle, '@3i')
275 # Vocab stored by Dictionary::save
--> 276 assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
277 assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
278 ntokens, = self.struct_unpack(file_handle, '@q')

AssertionError: mismatch between vocab sizes

Versions

Linux-4.4.0-75-generic-x86_64-with-debian-stretch-sid
Python 3.5.2 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.11.1
SciPy 0.18.1
gensim 1.0.1
FAST_VERSION 2

The text was updated successfully, but these errors were encountered:

matanox · 2017-05-07T10:44:18Z

Looks a bit like #1236. The error message might seem to imply a problem with how fasttext produces the data. Whereas the description above used gensim 1.0.1, it also reproduces with gensym 2.0.0.

File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/gensim/models/wrappers/fasttext.py", line 239, in load_fasttext_format
model.load_binary_data('%s.bin' % model_file, encoding=encoding)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/wrappers/fasttext.py", line 256, in load_binary_data
self.load_dict(f, encoding=encoding)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/wrappers/fasttext.py", line 277, in load_dict
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'

Notably this doesn't happen with all pretrained (Hebrew) embeddings created by fasttext.

kewlcoder · 2017-05-10T08:04:25Z

Has this issue been resolved?
If yes, can you please share the reference?

tmylk · 2017-05-10T10:04:19Z

Hi @kewlcoder This is not an issue with gensim wrapper but an issue with the trained FastText model as there is a mismatch in vocab between .bin and .vec files. @prakhar2b , could you please investigate and raise an issue in FastText repo?

Please use FastText.load_word2vec_format('FILENAME.vec') as a workaround for now.

prakhar2b · 2017-05-10T12:24:27Z

@tmylk @jayantj The mismatch is not only for the pretrained models released by facebook, but we are getting this error for all the models trained by fasttext, which was not the case earlier. Something might have changed in the fastText, looking into the code to see if it's intentional, then I'll raise an issue in the fastText repo.

jayantj · 2017-05-10T14:27:51Z

@prakhar2b Sounds good.
Also, the assertion to check for mismatch between vocab sizes between the .vec and .bin file was written as part of a defensive approach to make sure there weren't any "silent" bugs.
In case the mismatch doesn't make an actual difference, and it is possible to proceed with loading the model, changing the assert to a warning log would be a decent solution, IMO.

dgg5503 · 2017-05-10T15:21:11Z

About 8 days ago, fastText added two additional int32_t's to the .bin model header. These are "magic" and "version." As of right now, fasttext.py does not account for these integers when reading in model parameters causing it to be off by 2 integers for the rest of the reads. Take a look at the "checkModel" function in https://github.com/facebookresearch/fastText/blob/master/src/fasttext.cc to see what I'm talking about.

EDIT:
Also important to note is the addition of dictionary pruning which adds an int64_t (pruneidx_size) following the previous int64_t (ntokens) when reading in the dictionary from the .bin file. This is not accounted for in the current version of fasttext.py.

I have made some quick edits to fasttext.py that make it compatible with the latest version of fastText models. All I did was add additional reads so a plain model without quantization/dictionary pruning can be read. Please note that I didn't do any extensive testing so use at your own risk.

fasttext.zip

prakhar2b · 2017-05-12T04:31:32Z

@dgg5503 @tmylk @jayantj struct_unpack for models trained by fastText using text8 data -
(self.wv.vocab from .vec file, nwords & vocab_size from .bin using struct.unpack)

parameter	fastText (old)	fastText(new)
len(self.wv.vocab)	71290	71290
nwords	71290	1058682594
vocab_size	71290	-350469331

I'm not sure why we have negative value for vocab_size. @jayantj (please comment on this). If this is undesirable, we should report it to fastText repo.

Also, if this is an intentional mismatch (which seems to be), then there is no point in making any assert or warning statement. I think we should use vec and bin file separately for different purposes assuming that facebook's fastText code is working fine. This was also discussed for issue #1261 (improving fasttext loading time) while making comparision with salestock's fasttext loading mechanism which only used bin file for loading.

jayantj · 2017-05-12T21:40:40Z

@prakhar2b Are you sure about this? Looking at those values, it seems very likely to me that we are reading the wrong bytes for the values of nwords and vocab_size.

Also, on a different note: the issue raised about the model trained on French wiki is quite old, before the FastText magic or version variables were added. I believe they are probably two different issues.

prakhar2b · 2017-05-13T03:00:20Z

UPDATE : I've solved this. I'll submit a final PR asap. Thanks @dgg5503 for the suggestions.

tmylk mentioned this issue May 10, 2017

Gensim error when loading French FastText #1236

Closed

prakhar2b mentioned this issue May 14, 2017

support both old and new fastText model #1319

Merged

menshikh-iv closed this as completed in #1319 May 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gensim error while loading Hebrew #1301

Gensim error while loading Hebrew #1301

guybartal commented May 3, 2017 •

edited

Loading

matanox commented May 7, 2017 •

edited

Loading

kewlcoder commented May 10, 2017

tmylk commented May 10, 2017

prakhar2b commented May 10, 2017 •

edited

Loading

jayantj commented May 10, 2017 •

edited

Loading

dgg5503 commented May 10, 2017 •

edited

Loading

prakhar2b commented May 12, 2017 •

edited

Loading

jayantj commented May 12, 2017

prakhar2b commented May 13, 2017 •

edited

Loading

Gensim error while loading Hebrew #1301

Gensim error while loading Hebrew #1301

Comments

guybartal commented May 3, 2017 • edited Loading

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

matanox commented May 7, 2017 • edited Loading

kewlcoder commented May 10, 2017

tmylk commented May 10, 2017

prakhar2b commented May 10, 2017 • edited Loading

jayantj commented May 10, 2017 • edited Loading

dgg5503 commented May 10, 2017 • edited Loading

prakhar2b commented May 12, 2017 • edited Loading

jayantj commented May 12, 2017

prakhar2b commented May 13, 2017 • edited Loading

guybartal commented May 3, 2017 •

edited

Loading

matanox commented May 7, 2017 •

edited

Loading

prakhar2b commented May 10, 2017 •

edited

Loading

jayantj commented May 10, 2017 •

edited

Loading

dgg5503 commented May 10, 2017 •

edited

Loading

prakhar2b commented May 12, 2017 •

edited

Loading

prakhar2b commented May 13, 2017 •

edited

Loading