Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom glove vectors throw tuple index out of range error #1831

Closed
samrensenhouse opened this issue Jan 12, 2018 · 6 comments
Closed

Custom glove vectors throw tuple index out of range error #1831

samrensenhouse opened this issue Jan 12, 2018 · 6 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@samrensenhouse
Copy link

I tried loading in some custom glove vectors using the demo provided here:
https://github.com/stanfordnlp/GloVe/blob/master/demo.sh

I then made a directory called vectors with a vectors.50.d.bin inside as well as a vectors.txt

However, when I use the code below I get an IndexError:tuple index out of range

parser = spacy.load('en_core_web_sm')
parser.vocab.vectors.from_glove('C:\dev\glovepy\\vectors')
spacy_doc = parser('I am happy.')
for word in spacy_doc:
   print(t.vector)

Info about spaCy

  • spaCy version: 2.0.5
  • Platform: Windows-10-10.0.16299-SP0
  • Python version: 3.6.3
  • Models: en
@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Jan 12, 2018
@fabiocapsouza
Copy link

fabiocapsouza commented Jan 17, 2018

I'm experiencing the same issue.
I downloaded a trained a GloVe model for Portuguese from this repository. It comes as a single .txt file, so I loaded it using gensim's KeyedVectors and converted it to binary format with the vocab.txt files, using this command:

word_vectors.save_word2vec_format('vectors.50.f.bin', fvocab='vocab.txt', binary=True)

Then I loaded it into spaCy:

nlp = spacy.load('pt')
nlp.vocab.vectors.from_glove('/path/to/vectors')

The error happens if I try to read has_vector or vector properties.

Informations

  • spaCy version: 2.0.5
  • Platform: Ubuntu 16.04
  • Python version: 3.6.4
  • Model: pt
  • GloVe model: GloVe 50 dimensions

@ZackKorman
Copy link

ZackKorman commented Jan 18, 2018

I think the problem is in self.data.shape[0] * self.data.shape[1], as the GloVe array is shape (some_num,). self.data.shape[1] therefore returns the index out of range error. I don't have a fix for this, though.

@imranarshad
Copy link

having the same issue
@honnibal any workaround? until you get it fixed.

@honnibal
Copy link
Member

Thanks for the report, especially @Lankey22 for the suggestion.

Perhaps we need this in from_glove()?

if self.data.ndim == 1:
        self.data = self.data.reshape((self.data.size//width, width))

If so the following mitigation should work for now until the next version:

nlp = spacy.load('pt')
nlp.vocab.vectors.from_glove('/path/to/vectors')
if nlp.vocab.vectors.data.ndim == 1:
    nlp.vocab.vectors.data = nlp.vocab.vectors.data.reshape((nlp.vocab.vectors.data.size//width, width))

You'll need to know the width of the vectors you're loading.

@fako
Copy link

fako commented Feb 5, 2018

I also came across this issue and I'm using the same workaround. I find it weird that from_glove is using numpy.fromfile. The documentation states that using tofile and fromfile is not suitable for data storage: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.fromfile.html

If you'd use np.load then it would load a 2D array if it was stored as such. np.fromfile always loads a a 1D array. Not 100% sure how GloVe's binary format is stored, but I would expect as a 2D array. I'm loading word2vec embeddings myself and I saved the conversion in a 2D array.

Another thing that strikes me is that in the documentation it is stated that the dtype in the file format should either be 'f' or 'd'. That means that any file read in this manner will get flattened by np.ascontiguousarray, because neither equal the string 'float32'. After flattening it would get reshaped again to a 2D array. Relevant line is here:

if dtype != 'float32':

I might have made some wrong assumptions, but it seems to me that this code is not running as efficient as it could. Would be great to hear why certain choices were made. I love working with SpaCy and hope it becomes even better in the future :)

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

6 participants