Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gensim error when loading French FastText #1236

Closed
jplu opened this issue Mar 23, 2017 · 14 comments
Closed

Gensim error when loading French FastText #1236

jplu opened this issue Mar 23, 2017 · 14 comments
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills

Comments

@jplu
Copy link

jplu commented Mar 23, 2017

Hello,

I'm trying to use the fasttext wrapper in order to load the French model that one can find here. Unfortunately I get the following error:

Traceback (most recent call last):
  File "app.py", line 18, in <module>
    model = FastText.load_fasttext_format(os.path.abspath('./wiki.fr'))
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 238, in load_fasttext_format
    model.load_binary_data('%s.bin' % model_file)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 255, in load_binary_data
    self.load_dict(f)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 277, in load_dict
    assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
AssertionError: mismatch between vocab sizes

I'm using the following environment:

>>> import platform; print(platform.platform())
Darwin-16.4.0-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
('Python', '2.7.13 (default, Dec 28 2016, 14:29:07) \n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]')
>>> import numpy; print("NumPy", numpy.__version__)
('NumPy', '1.12.0')
>>> import scipy; print("SciPy", scipy.__version__)
('SciPy', '0.19.0')
>>> import gensim; print("gensim", gensim.__version__)
('gensim', '1.0.1')
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
('FAST_VERSION', 0)

Steps to reproduce the error:

wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.fr.zip
unzip wiki.fr.zip
python -c "import os;from gensim.models.wrappers import FastText;FastText.load_fasttext_format(os.path.abspath('./wiki.fr'))"

I don't know if it is a bug from gensim or an issue from the used model. Any help would be appreciated.

Thanks in advance.

@akutuzov
Copy link
Contributor

Try the develop branch of Gensim, I think #1189 has something to do with your problem.

@jplu
Copy link
Author

jplu commented Mar 23, 2017

Unfortunately I get the exact same error, here the steps I have done:

cd /tmp
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.fr.zip
unzip wiki.fr.zip
pip uninstall gensim
git clone https://github.com/RaRe-Technologies/gensim
PYTHONPATH="/tmp/gensim" python -c "import os;from gensim.models.wrappers import FastText;FastText.load_fasttext_format(os.path.abspath('./wiki.fr'))"

@tmylk
Copy link
Contributor

tmylk commented Mar 23, 2017

Thanks for reporting. The error is different from ValueError: invalid vector on line 12898 fixed by @jayantj in #1189.

It might be accidentally fixed in #1214 branch - you are welcome to clone that code.

It would be easier to fix if there was some smaller model to reproduce... Unfortunately the download takes many hours.

@jplu
Copy link
Author

jplu commented Mar 23, 2017

I tried the same steps than previously but the cloned repo is "https://github.com/jaksmid/gensim". And I still get the exact same error :(

@tmylk
Copy link
Contributor

tmylk commented Mar 23, 2017

Can you partially load the model with model = FastText.load_word2vec_format('FILENAME.vec')?

The failing part is model.load_binary_data('FILENAME.bin') but you might not need that, depending on your use case.

@tmylk tmylk added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Mar 23, 2017
@jayantj
Copy link
Contributor

jayantj commented Mar 23, 2017

I've managed to download the model, looking into the bug.

@tmylk
Copy link
Contributor

tmylk commented Mar 24, 2017

Thanks for looking into this @jayantj . I will make a new release after this is fixed.

@jplu
Copy link
Author

jplu commented Mar 24, 2017

@tmylk your proposal to FastText.load_word2vec_format('FILENAME.vec') is working.

@tmylk
Copy link
Contributor

tmylk commented May 2, 2017

There is a mismatch in vocab between .bin and .vec files. We should raise it with FastText project that created the model. CC @prakhar2b

@jplu
Copy link
Author

jplu commented May 3, 2017

Thanks for the update!

@kewlcoder
Copy link

Has this issue been resolved?
If yes, can you please share the reference?

@tmylk
Copy link
Contributor

tmylk commented May 10, 2017

@kewlcoder replied to the same question in #1301

@jayantj
Copy link
Contributor

jayantj commented May 16, 2017

The issue for loading the French wiki is most likely due to a FastText bug - reported here - facebookresearch/fastText#218

The issue with loading the latest FastText models (including the Hebrew model) is due to a change in the way the new models are stored, and will be fixed in #1319

@menshikh-iv
Copy link
Contributor

Fixed in #1341 & #1319

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills
Projects
None yet
Development

No branches or pull requests

6 participants