Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gensim error while loading Hebrew #1301

Closed
guybartal opened this issue May 3, 2017 · 9 comments
Closed

Gensim error while loading Hebrew #1301

guybartal opened this issue May 3, 2017 · 9 comments

Comments

@guybartal
Copy link

guybartal commented May 3, 2017

Description

Gensim error while loading Hebrew

Steps/Code/Corpus to Reproduce

from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format('wiki.he')

Expected Results

Actual Results


AssertionError Traceback (most recent call last)
in ()
2
3 #num_dims = 300
----> 4 model = FastText.load_fasttext_format('wiki.he')

/home/vmadmin/anaconda3/lib/python3.5/site-packages/gensim/models/wrappers/fasttext.py in load_fasttext_format(cls, model_file)
236 model = cls()
237 model.wv = cls.load_word2vec_format('%s.vec' % model_file)
--> 238 model.load_binary_data('%s.bin' % model_file)
239 return model
240

/home/vmadmin/anaconda3/lib/python3.5/site-packages/gensim/models/wrappers/fasttext.py in load_binary_data(self, model_binary_file)
253 with utils.smart_open(model_binary_file, 'rb') as f:
254 self.load_model_params(f)
--> 255 self.load_dict(f)
256 self.load_vectors(f)
257

/home/vmadmin/anaconda3/lib/python3.5/site-packages/gensim/models/wrappers/fasttext.py in load_dict(self, file_handle)
274 (vocab_size, nwords, _) = self.struct_unpack(file_handle, '@3i')
275 # Vocab stored by Dictionary::save
--> 276 assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
277 assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
278 ntokens, = self.struct_unpack(file_handle, '@q')

AssertionError: mismatch between vocab sizes

Versions

Linux-4.4.0-75-generic-x86_64-with-debian-stretch-sid
Python 3.5.2 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.11.1
SciPy 0.18.1
gensim 1.0.1
FAST_VERSION 2

@matanox
Copy link

matanox commented May 7, 2017

Looks a bit like #1236. The error message might seem to imply a problem with how fasttext produces the data. Whereas the description above used gensim 1.0.1, it also reproduces with gensym 2.0.0.

File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/gensim/models/wrappers/fasttext.py", line 239, in load_fasttext_format
model.load_binary_data('%s.bin' % model_file, encoding=encoding)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/wrappers/fasttext.py", line 256, in load_binary_data
self.load_dict(f, encoding=encoding)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/wrappers/fasttext.py", line 277, in load_dict
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'

Notably this doesn't happen with all pretrained (Hebrew) embeddings created by fasttext.

@kewlcoder
Copy link

Has this issue been resolved?
If yes, can you please share the reference?

@tmylk
Copy link
Contributor

tmylk commented May 10, 2017

Hi @kewlcoder This is not an issue with gensim wrapper but an issue with the trained FastText model as there is a mismatch in vocab between .bin and .vec files. @prakhar2b , could you please investigate and raise an issue in FastText repo?

Please use FastText.load_word2vec_format('FILENAME.vec') as a workaround for now.

@prakhar2b
Copy link
Contributor

prakhar2b commented May 10, 2017

@tmylk @jayantj The mismatch is not only for the pretrained models released by facebook, but we are getting this error for all the models trained by fasttext, which was not the case earlier. Something might have changed in the fastText, looking into the code to see if it's intentional, then I'll raise an issue in the fastText repo.

@jayantj
Copy link
Contributor

jayantj commented May 10, 2017

@prakhar2b Sounds good.
Also, the assertion to check for mismatch between vocab sizes between the .vec and .bin file was written as part of a defensive approach to make sure there weren't any "silent" bugs.
In case the mismatch doesn't make an actual difference, and it is possible to proceed with loading the model, changing the assert to a warning log would be a decent solution, IMO.

@dgg5503
Copy link

dgg5503 commented May 10, 2017

About 8 days ago, fastText added two additional int32_t's to the .bin model header. These are "magic" and "version." As of right now, fasttext.py does not account for these integers when reading in model parameters causing it to be off by 2 integers for the rest of the reads. Take a look at the "checkModel" function in https://github.com/facebookresearch/fastText/blob/master/src/fasttext.cc to see what I'm talking about.

EDIT:
Also important to note is the addition of dictionary pruning which adds an int64_t (pruneidx_size) following the previous int64_t (ntokens) when reading in the dictionary from the .bin file. This is not accounted for in the current version of fasttext.py.

I have made some quick edits to fasttext.py that make it compatible with the latest version of fastText models. All I did was add additional reads so a plain model without quantization/dictionary pruning can be read. Please note that I didn't do any extensive testing so use at your own risk.

fasttext.zip

@prakhar2b
Copy link
Contributor

prakhar2b commented May 12, 2017

@dgg5503 @tmylk @jayantj struct_unpack for models trained by fastText using text8 data -
(self.wv.vocab from .vec file, nwords & vocab_size from .bin using struct.unpack)

parameter fastText (old) fastText(new)
len(self.wv.vocab) 71290 71290
nwords 71290 1058682594
vocab_size 71290 -350469331

I'm not sure why we have negative value for vocab_size. @jayantj (please comment on this). If this is undesirable, we should report it to fastText repo.

Also, if this is an intentional mismatch (which seems to be), then there is no point in making any assert or warning statement. I think we should use vec and bin file separately for different purposes assuming that facebook's fastText code is working fine. This was also discussed for issue #1261 (improving fasttext loading time) while making comparision with salestock's fasttext loading mechanism which only used bin file for loading.

@jayantj
Copy link
Contributor

jayantj commented May 12, 2017

@prakhar2b Are you sure about this? Looking at those values, it seems very likely to me that we are reading the wrong bytes for the values of nwords and vocab_size.

Also, on a different note: the issue raised about the model trained on French wiki is quite old, before the FastText magic or version variables were added. I believe they are probably two different issues.

@prakhar2b
Copy link
Contributor

prakhar2b commented May 13, 2017

UPDATE : I've solved this. I'll submit a final PR asap. Thanks @dgg5503 for the suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants