Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_vectors should allow whitespace tokens for Gensim compatibility #737

Closed
danielhers opened this issue Jan 12, 2017 · 3 comments
Closed
Labels
enhancement Feature requests and improvements

Comments

@danielhers
Copy link
Contributor

danielhers commented Jan 12, 2017

I am calling load_vectors on the text file produced from the following:

I am getting:

spacy.vocab.VectorReadError: Error reading word vectors from <_io.TextIOWrapper name='word_vectors/GoogleNews-vectors-negative300.txt' mode='r' encoding='UTF-8'> on line 431879.
All vectors must be the same size.
Prev size: 300
Curr size: 299

I looked at line 431880 (the error message above is zero-based), and indeed there are just 300 elements in it (instead of 301 for the token + numbers) because the "token" there is just a space...
You could say the file is faulty for that reason, but since it's widely used I think spaCy should be able to handle it.

  • Python Version Used: 3.5
  • spaCy Version Used: 1.5.0
@honnibal honnibal added the enhancement Feature requests and improvements label Jan 12, 2017
@honnibal
Copy link
Member

honnibal commented Jan 12, 2017

That function vector format should be compatible with Gensim including all edge cases.

Want to make the pull request? Should be quite easy to make the test.

Btw I highly recommend saving to binary format after you've loaded once --- it'll be much faster to load.

@honnibal honnibal changed the title Cannot load GoogleNews-vectors-negative300 load_vectors should allow whitespace tokens for Gensim compatibility Jan 12, 2017
honnibal added a commit that referenced this issue Jan 12, 2017
Fix #737: support loading word vectors with " " as a word
@irfan-zoefit
Copy link

irfan-zoefit commented Jun 21, 2017

I'm getting the same issue after compiling the spacy from source.
I've created the vectors from spacy-dev-resources/training. It successfully creates the vectors for text as well as binary format. Here is the output of the error.

File "spacy/vocab.pyx", line 553, in spacy.vocab.Vocab.load_vectors (spacy/vocab.cpp:10950)
spacy.vocab.VectorReadError: Error reading word vectors from <_io.TextIOWrapper name=u'zoe_models/zoe-ner/vocab/vectors.txt' encoding='UTF-8'> on line 1.
All vectors must be the same size.
Prev size: 1
Curr size: 0

Info about spaCy

Python version     2.7.12         
Platform           Linux-4.4.0-79-generic-x86_64-with-Ubuntu-16.04-xenial
spaCy version      1.8.2          
Installed models    en, en_default 
Location          XX/spaCy/spacy

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements
Projects
None yet
Development

No branches or pull requests

3 participants