Gensim error when loading French FastText #1236

jplu · 2017-03-23T11:24:37Z

Hello,

I'm trying to use the fasttext wrapper in order to load the French model that one can find here. Unfortunately I get the following error:

Traceback (most recent call last):
  File "app.py", line 18, in <module>
    model = FastText.load_fasttext_format(os.path.abspath('./wiki.fr'))
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 238, in load_fasttext_format
    model.load_binary_data('%s.bin' % model_file)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 255, in load_binary_data
    self.load_dict(f)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 277, in load_dict
    assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
AssertionError: mismatch between vocab sizes

I'm using the following environment:

>>> import platform; print(platform.platform())
Darwin-16.4.0-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
('Python', '2.7.13 (default, Dec 28 2016, 14:29:07) \n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]')
>>> import numpy; print("NumPy", numpy.__version__)
('NumPy', '1.12.0')
>>> import scipy; print("SciPy", scipy.__version__)
('SciPy', '0.19.0')
>>> import gensim; print("gensim", gensim.__version__)
('gensim', '1.0.1')
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
('FAST_VERSION', 0)

Steps to reproduce the error:

wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.fr.zip
unzip wiki.fr.zip
python -c "import os;from gensim.models.wrappers import FastText;FastText.load_fasttext_format(os.path.abspath('./wiki.fr'))"

I don't know if it is a bug from gensim or an issue from the used model. Any help would be appreciated.

Thanks in advance.

The text was updated successfully, but these errors were encountered:

akutuzov · 2017-03-23T19:12:27Z

Try the develop branch of Gensim, I think #1189 has something to do with your problem.

jplu · 2017-03-23T19:33:52Z

Unfortunately I get the exact same error, here the steps I have done:

cd /tmp
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.fr.zip
unzip wiki.fr.zip
pip uninstall gensim
git clone https://github.com/RaRe-Technologies/gensim
PYTHONPATH="/tmp/gensim" python -c "import os;from gensim.models.wrappers import FastText;FastText.load_fasttext_format(os.path.abspath('./wiki.fr'))"

tmylk · 2017-03-23T19:49:06Z

Thanks for reporting. The error is different from ValueError: invalid vector on line 12898 fixed by @jayantj in #1189.

It might be accidentally fixed in #1214 branch - you are welcome to clone that code.

It would be easier to fix if there was some smaller model to reproduce... Unfortunately the download takes many hours.

jplu · 2017-03-23T20:01:37Z

I tried the same steps than previously but the cloned repo is "https://github.com/jaksmid/gensim". And I still get the exact same error :(

tmylk · 2017-03-23T20:03:47Z

Can you partially load the model with model = FastText.load_word2vec_format('FILENAME.vec')?

The failing part is model.load_binary_data('FILENAME.bin') but you might not need that, depending on your use case.

jayantj · 2017-03-23T20:51:22Z

I've managed to download the model, looking into the bug.

tmylk · 2017-03-24T00:55:53Z

Thanks for looking into this @jayantj . I will make a new release after this is fixed.

jplu · 2017-03-24T08:42:19Z

@tmylk your proposal to FastText.load_word2vec_format('FILENAME.vec') is working.

tmylk · 2017-05-02T23:58:50Z

There is a mismatch in vocab between .bin and .vec files. We should raise it with FastText project that created the model. CC @prakhar2b

jplu · 2017-05-03T08:06:43Z

Thanks for the update!

kewlcoder · 2017-05-10T08:04:04Z

Has this issue been resolved?
If yes, can you please share the reference?

tmylk · 2017-05-10T10:04:48Z

@kewlcoder replied to the same question in #1301

jayantj · 2017-05-16T17:48:18Z

The issue for loading the French wiki is most likely due to a FastText bug - reported here - facebookresearch/fastText#218

The issue with loading the latest FastText models (including the Hebrew model) is due to a change in the way the new models are stored, and will be fixed in #1319

menshikh-iv · 2017-06-28T09:28:21Z

Fixed in #1341 & #1319

tmylk added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Mar 23, 2017

matanox mentioned this issue May 7, 2017

Gensim error while loading Hebrew #1301

Closed

prakhar2b mentioned this issue May 14, 2017

support both old and new fastText model #1319

Merged

jayantj mentioned this issue May 16, 2017

mismatch in vec and bin files in french pretrained vector facebookresearch/fastText#218

Closed

prakhar2b mentioned this issue May 22, 2017

Loading fastText models using only bin file #1341

Merged

beeva-enriqueotero mentioned this issue May 25, 2017

FastText model gets error with typical methods #1343

Closed

menshikh-iv closed this as completed Jun 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gensim error when loading French FastText #1236

Gensim error when loading French FastText #1236

jplu commented Mar 23, 2017

akutuzov commented Mar 23, 2017

jplu commented Mar 23, 2017

tmylk commented Mar 23, 2017

jplu commented Mar 23, 2017 •

edited

Loading

tmylk commented Mar 23, 2017

jayantj commented Mar 23, 2017

tmylk commented Mar 24, 2017

jplu commented Mar 24, 2017

tmylk commented May 2, 2017

jplu commented May 3, 2017

kewlcoder commented May 10, 2017

tmylk commented May 10, 2017

jayantj commented May 16, 2017

menshikh-iv commented Jun 28, 2017

Gensim error when loading French FastText #1236

Gensim error when loading French FastText #1236

Comments

jplu commented Mar 23, 2017

akutuzov commented Mar 23, 2017

jplu commented Mar 23, 2017

tmylk commented Mar 23, 2017

jplu commented Mar 23, 2017 • edited Loading

tmylk commented Mar 23, 2017

jayantj commented Mar 23, 2017

tmylk commented Mar 24, 2017

jplu commented Mar 24, 2017

tmylk commented May 2, 2017

jplu commented May 3, 2017

kewlcoder commented May 10, 2017

tmylk commented May 10, 2017

jayantj commented May 16, 2017

menshikh-iv commented Jun 28, 2017

jplu commented Mar 23, 2017 •

edited

Loading