Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

mismatch in vec and bin files in french pretrained vector #218

Closed
prakhar2b opened this issue May 16, 2017 · 2 comments
Closed

mismatch in vec and bin files in french pretrained vector #218

prakhar2b opened this issue May 16, 2017 · 2 comments

Comments

@prakhar2b
Copy link

prakhar2b commented May 16, 2017

For all other (pretrained) vector models, vocab_size obtained from vec file is equal to the size and nwords obtained from bin file. ( this line)

But for wiki.fr , vocab_size is 1152449, size is 1152450, and nwords is 1152449. Further analysing, this additional vocab_word is u'__label__' which is not present in vec file, or any other pretrained vector model.

This doesn't cause any bug in fastText code, but I find it little unusual. It would be really helpful, if somebody could provide an insight or explanation behind this.

Note: - This is important because often, it is more convenient to load vectors from vec file and additional parameters from bin file. This sort of mismatch causes unnecessary complexity in the codes.

@jayantj
Copy link

jayantj commented May 16, 2017

I looked into this further - the fact that the additional word was u'__label__' seemed a little suspicious, since that is also used in the input data to FastText supervised models to denote a label.

So I'm assuming the French wiki has the term __label__ somewhere. The threshold method of Dictionary checks to see if the word begins with the term __label__, and if it does, it is labeled as entry_type::label (as opposed to entry_type::word). When a "word" is added, nwords in the dictionary is incremented, when a "label" is added, nlabels is incremented, size is incremented either way.

The actual term __label__ is therefore marked as a label, and ignored while serializing the vectors to the .vec file. This is also the reason for a mismatch between size and nwords even in unsupervised models.

I'm surprised the term __label__ actually exists in the input training data for wiki.fr though.

We've written a Python wrapper as part of Gensim to allow users to load FastText models and use word vector functionality already present in gensim, and this bug was affecting some of our users - piskvorky/gensim#1236.

Is this likely to be fixed in the near future, or is it too niche? If so, we don't mind adding a workaround in our wrapper.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Jul 4, 2017

Hello @prakhar2b,

This issues has recently been resolved. We updated the models and .vec files to resolve this mismatch. Now all the vectors in both the bin+text and text version should match. Please feel free to reopen at any time if that is not the case.

Thanks,
Christian

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants