-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_vectors should allow whitespace tokens for Gensim compatibility #737
Comments
That function vector format should be compatible with Gensim including all edge cases. Want to make the pull request? Should be quite easy to make the test. Btw I highly recommend saving to binary format after you've loaded once --- it'll be much faster to load. |
Fix #737: support loading word vectors with " " as a word
I'm getting the same issue after compiling the spacy from source. File "spacy/vocab.pyx", line 553, in spacy.vocab.Vocab.load_vectors (spacy/vocab.cpp:10950)
spacy.vocab.VectorReadError: Error reading word vectors from <_io.TextIOWrapper name=u'zoe_models/zoe-ner/vocab/vectors.txt' encoding='UTF-8'> on line 1.
All vectors must be the same size.
Prev size: 1
Curr size: 0 Info about spaCy
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I am calling
load_vectors
on the text file produced from the following:I am getting:
I looked at line 431880 (the error message above is zero-based), and indeed there are just 300 elements in it (instead of 301 for the token + numbers) because the "token" there is just a space...
You could say the file is faulty for that reason, but since it's widely used I think spaCy should be able to handle it.
The text was updated successfully, but these errors were encountered: