mismatch in vec and bin files in french pretrained vector #218

prakhar2b · 2017-05-16T00:24:56Z

For all other (pretrained) vector models, vocab_size obtained from vec file is equal to the size and nwords obtained from bin file. ( this line)

But for wiki.fr , vocab_size is 1152449, size is 1152450, and nwords is 1152449. Further analysing, this additional vocab_word is u'__label__' which is not present in vec file, or any other pretrained vector model.

This doesn't cause any bug in fastText code, but I find it little unusual. It would be really helpful, if somebody could provide an insight or explanation behind this.

Note: - This is important because often, it is more convenient to load vectors from vec file and additional parameters from bin file. This sort of mismatch causes unnecessary complexity in the codes.

The text was updated successfully, but these errors were encountered:

jayantj · 2017-05-16T17:43:10Z

I looked into this further - the fact that the additional word was u'__label__' seemed a little suspicious, since that is also used in the input data to FastText supervised models to denote a label.

So I'm assuming the French wiki has the term __label__ somewhere. The threshold method of Dictionary checks to see if the word begins with the term __label__, and if it does, it is labeled as entry_type::label (as opposed to entry_type::word). When a "word" is added, nwords in the dictionary is incremented, when a "label" is added, nlabels is incremented, size is incremented either way.

The actual term __label__ is therefore marked as a label, and ignored while serializing the vectors to the .vec file. This is also the reason for a mismatch between size and nwords even in unsupervised models.

I'm surprised the term __label__ actually exists in the input training data for wiki.fr though.

We've written a Python wrapper as part of Gensim to allow users to load FastText models and use word vector functionality already present in gensim, and this bug was affecting some of our users - piskvorky/gensim#1236.

Is this likely to be fixed in the near future, or is it too niche? If so, we don't mind adding a workaround in our wrapper.

cpuhrsch · 2017-07-04T16:35:52Z

Hello @prakhar2b,

This issues has recently been resolved. We updated the models and .vec files to resolve this mismatch. Now all the vectors in both the bin+text and text version should match. Please feel free to reopen at any time if that is not the case.

Thanks,
Christian

jayantj mentioned this issue May 16, 2017

Gensim error when loading French FastText piskvorky/gensim#1236

Closed

cpuhrsch closed this as completed Jul 4, 2017

ElSaico mentioned this issue Oct 23, 2017

Added better error message for unsupported fastText supervised models piskvorky/gensim#1645

Merged

prakhar2b mentioned this issue Mar 17, 2019

Filename overwrite in FastText.load_fasttext_format piskvorky/gensim#2407

Closed

This was referenced Jun 15, 2019

Error while loading French binary model piskvorky/gensim#2529

Closed

Error while loading French binary model #822

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mismatch in vec and bin files in french pretrained vector #218

mismatch in vec and bin files in french pretrained vector #218

prakhar2b commented May 16, 2017 •

edited

Loading

jayantj commented May 16, 2017 •

edited

Loading

cpuhrsch commented Jul 4, 2017 •

edited

Loading

mismatch in vec and bin files in french pretrained vector #218

mismatch in vec and bin files in french pretrained vector #218

Comments

prakhar2b commented May 16, 2017 • edited Loading

jayantj commented May 16, 2017 • edited Loading

cpuhrsch commented Jul 4, 2017 • edited Loading

prakhar2b commented May 16, 2017 •

edited

Loading

jayantj commented May 16, 2017 •

edited

Loading

cpuhrsch commented Jul 4, 2017 •

edited

Loading