load_vectors should allow whitespace tokens for Gensim compatibility #737

danielhers · 2017-01-12T08:34:49Z

I am calling load_vectors on the text file produced from the following:

Download GoogleNews-vectors-negative300.bin.gz
Use Gensim to convert to text (need about 8G RAM and disk space): http://stackoverflow.com/a/33183634/223267

I am getting:

spacy.vocab.VectorReadError: Error reading word vectors from <_io.TextIOWrapper name='word_vectors/GoogleNews-vectors-negative300.txt' mode='r' encoding='UTF-8'> on line 431879.
All vectors must be the same size.
Prev size: 300
Curr size: 299

I looked at line 431880 (the error message above is zero-based), and indeed there are just 300 elements in it (instead of 301 for the token + numbers) because the "token" there is just a space...
You could say the file is faulty for that reason, but since it's widely used I think spaCy should be able to handle it.

Python Version Used: 3.5
spaCy Version Used: 1.5.0

The text was updated successfully, but these errors were encountered:

honnibal · 2017-01-12T09:47:27Z

That function vector format should be compatible with Gensim including all edge cases.

Want to make the pull request? Should be quite easy to make the test.

Btw I highly recommend saving to binary format after you've loaded once --- it'll be much faster to load.

Fix #737: support loading word vectors with " " as a word

irfan-zoefit · 2017-06-21T08:27:09Z

I'm getting the same issue after compiling the spacy from source.
I've created the vectors from spacy-dev-resources/training. It successfully creates the vectors for text as well as binary format. Here is the output of the error.

File "spacy/vocab.pyx", line 553, in spacy.vocab.Vocab.load_vectors (spacy/vocab.cpp:10950)
spacy.vocab.VectorReadError: Error reading word vectors from <_io.TextIOWrapper name=u'zoe_models/zoe-ner/vocab/vectors.txt' encoding='UTF-8'> on line 1.
All vectors must be the same size.
Prev size: 1
Curr size: 0

Info about spaCy

Python version     2.7.12         
Platform           Linux-4.4.0-79-generic-x86_64-with-Ubuntu-16.04-xenial
spaCy version      1.8.2          
Installed models    en, en_default 
Location          XX/spaCy/spacy

lock · 2018-05-08T20:38:17Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the enhancement Feature requests and improvements label Jan 12, 2017

honnibal changed the title ~~Cannot load GoogleNews-vectors-negative300~~ load_vectors should allow whitespace tokens for Gensim compatibility Jan 12, 2017

honnibal closed this as completed in 99eb494 Jan 12, 2017

honnibal added a commit that referenced this issue Jan 12, 2017

Merge pull request #738 from danielhers/master

a6d7147

Fix #737: support loading word vectors with " " as a word

raphael0202 mentioned this issue Feb 16, 2017

load_vectors should accept arbitrary space characters as word tokens #834

Closed

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_vectors should allow whitespace tokens for Gensim compatibility #737

load_vectors should allow whitespace tokens for Gensim compatibility #737

danielhers commented Jan 12, 2017 •

edited

Loading

honnibal commented Jan 12, 2017 •

edited

Loading

irfan-zoefit commented Jun 21, 2017 •

edited

Loading

lock bot commented May 8, 2018

load_vectors should allow whitespace tokens for Gensim compatibility #737

load_vectors should allow whitespace tokens for Gensim compatibility #737

Comments

danielhers commented Jan 12, 2017 • edited Loading

honnibal commented Jan 12, 2017 • edited Loading

irfan-zoefit commented Jun 21, 2017 • edited Loading

Info about spaCy

lock bot commented May 8, 2018

danielhers commented Jan 12, 2017 •

edited

Loading

honnibal commented Jan 12, 2017 •

edited

Loading

irfan-zoefit commented Jun 21, 2017 •

edited

Loading