FastText native VS original, different outputs #1940

menshikh-iv · 2018-02-28T05:24:26Z

Intro

As the person mentioned in mailing list, he receives different result with a pre-trained model with gensim code & original facebook code.

How to reproduce

Install Facebook FastText

wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
unzip v0.1.0.zip
cd fastText-0.1.0
make

Download pre-trained vectors from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md (I pick english, only as example), you need bin+text link
Unpack archive with vectors

Try to retreive vectors with gensim

from gensim.models import FastText

m = FastText.load_fasttext_format("wiki.en.vec")
print(m["hello"])  # existent word
print(m["someundefinedword"])  # non-existent word

Repeat it with original FastText implementation

./fasttext  print-word-vectors ../wiki.en.bin
hello
someundefinedword

Compare vectors from (5) and (6)

Expected Results

Vectors for "hello" and "someundefinedword" exactly same (from gensim & Facebook)

Actual result

Exactly same vectors for "hello", but different for "someundefinedword"

CC: @manneshiva

The text was updated successfully, but these errors were encountered:

manneshiva · 2018-03-01T14:55:38Z

This behavior is expected as pointed out by this comment in the unit test file. The vector for an OOV word in Gensim is likely to be slightly different compared to the vector obtained from the Original Fasttext implementation for the same OOV word. This is because the Gensim code discards un-used ngram vectors (to save memory) while the Original implementation keeps all the buckets (and hence all ngrams). So it is possible that a new OOV words might contain a few ngrams whose vectors might be missing after the discarding. Such a case is highly unlikely (depending on the bucket size and vocab size) after this PR #1916 (merged after the creation of this issue).
P.S.: The current code does not explicitly store the ngrams anymore, but only the hashes. It also discards unused hashes.

menshikh-iv · 2018-03-01T15:17:39Z

Thanks for detailed explanation @manneshiva 👍

gojomo · 2018-03-05T16:15:39Z

That we might throw out never-encountered (and thus never-trained) n-grams might be an acceptable optimization, for the case where a model is fully trained inside gensim. (After all, if needing arbitrary random untrained vectors for those n-grams later, they can be created later.)

But, to discard them from an externally-trained, loaded, static model, and thus get different vectors than the original FastText in what should be a completely deterministic process, strikes me as a deviation from expected behavior, and thus a bug, despite any explanation. Your thoughts, @piskvorky?

piskvorky · 2018-03-05T19:46:39Z

In general we want to stick to the original, yes.

@manneshiva what does "unused" mean for externally loaded models? How much do we gain by modifying the static model's default behaviour there?

I don't have enough intuition about the trade-offs, the pros and cons.

phdowling · 2018-04-05T14:14:50Z

I guess the expected behavior is that, if I load a model, all the vectors in that model are actually used in the end. I think deleting parts of a loaded binary model is not ideal, especially because it doesn't seem to be clearly pointed out anywhere. So the cons seem relatively clear - you are loading a model that may have been evaluated to have a certain quality, but under gensim you are no longer guaranteed that the quality will hold up. The pros are hard to judge for me, I'm not sure how hard it would be to adapt the code so this doesn't happen.

And like @piskvorky I'm also interested where the determination of "unused" subword vectors is actually made, since there's no training corpus in this setting.

Are there currently plans to change this?

manneshiva · 2018-04-05T15:02:58Z

In order to reduce the ngrams matrix size, the original Fasttext code (by Facebook) does not store a unique vector for each ngram but maps each ngram to a bucket using a hash function. The size of ngrams matrix is thus limited to bucket_size x vector_size. Due to this, there is no guarantee that all the vectors corresponding to each bucket is used/trained. The actual number of vectors (from the ngrams matrix) that are trained/updated will depend on the size of the bucket (default: 2 Million) and the ngrams encountered in the training corpus.
Since the only ngram vectors that are actually trained are the ngrams of the in-vocabulary words, "unused" buckets here refer to the hashes(bucket indices) that do not correspond to the hashes of any ngrams (of in-vocab words).
There has been some discussion around this trade-off between reducing memory usage of the model (without compromising the quality of word vectors) vs deviating from "deterministic/expected behavior". The decision to discard "unused" ngram buckets was a part of the FastText wrapper in Gensim based on the memory profiling result seen in this comment. Also look at the discussions/memory profiling results in this issue.
IMO, it is best to provide the user with an option/parameter to either discard unused vectors (without loss in quality of word vectors) or not.

piskvorky · 2018-04-05T15:07:20Z

Ah, if it's just untrained garbage, we should definitely discard that. If someone relies on an exact reproducibility from random/untrained vectors, their app is broken anyway. That's not the kind of compatibility we care about.

@manneshiva I assume the determination of used/unused is straightforward? Or is any guessing/heuristics involved, any room for error?

gojomo · 2018-04-05T17:54:50Z

I would think that if some other FastText implementation (like the original) saves out 'untrained garbage' in its serialized model, and loads-and-uses such noise in its subsequent OOV calculations after re-loading, so that it affects (reproducible) evaluations of the frozen model, then we're not really format compatible if we make the independent decision to discard that noise. And, we'll get a continuing tail of "what's up with this?" questions/bug-reports from that decision to be arbitrarily different in how we load a (frozen, completed, original-tool) model.

jayantj · 2018-04-06T04:16:57Z

Along with reproducibility issues, another point of discussion over this was that there is a memory/speed trade-off involved (during loading). Quoting from a previous comment -

"For relatively small vocab sizes (~200k), the steady-state memory usage is 1.1 GB lower than it would be if we chose to keep all ngram vectors as is. (for 300-d vectors). This is at the cost of significantly increased loading time.

Conversely, for large vocab sizes (like for Wikipedia models), we don't reduce memory usage, while also causing much higher load times. (as @gojomo rightly pointed out)

In case the common use case is indeed loading large models, it might make sense to store ngram vectors as is, without trying to discard any unused ones."

For this, Shiva proposed this solution here, which, IMO gives us the best of both worlds, while also providing exactly same behaviour by default as the original FastText, which should reduce the follow up questions/bug reports we get about this. The solution as @manneshiva described it -

"I feel we should give the user an option -- discard_unused_ngrams to save memory, which by default could be False. Since the memory saved for small vocab models is significant (owing to a fewer number of total used ngrmas), this should be helpful for a user trying to load a small vocab model with limited RAM."

@manneshiva @piskvorky @gojomo @menshikh-iv do you see any potential issues with this approach? if not, I think we should go ahead with it.

@gojomo has raised a valid concern that in theory, our heuristic for determining unused vectors could fail, if the serialized model has had the vocabulary trimmed after training and before serializing. However, FastText doesn't do anything of the sort (and neither do any other models in my experience), so I think it's a reasonable assumption to make. A note/info level log in the code could be useful

menshikh-iv · 2018-04-06T04:52:27Z

I agree with #1261 (comment) and @jayantj (looks like best variant).

We also definitely need to update documentation in this place (for avoid user confusion).

Alice-Ke · 2018-08-28T17:18:11Z

@manneshiva @gojomo Hi, thanks for your answers. I am encountering the similar problem with the author. I am confused about what this 'never-encountered (and thus never-trained) n-grams' comes from ? Is it initialized as a random vector for each bucket at the beginning? Thanks!

leezu · 2018-08-28T20:44:26Z

@Alice-Ke yes, all ngram vectors are initialized randomly before training. The gensim implementation throws the ones away that are not used by any of the known words. However, if you now look up the vector for a previously unknown word which contains some new ngram, gensim will return wrong results as it threw away the respective ngram vector. Further difference likely comes from possibly wrong handling of unicode characters: #2059

menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Feb 28, 2018

menshikh-iv assigned manneshiva Feb 28, 2018

menshikh-iv closed this as completed Mar 1, 2018

Alice-Ke unassigned manneshiva Aug 28, 2018

This was referenced May 9, 2019

Baseline model notebook and embeddings trainer notebook microsoft/nlp-recipes#47

Merged

Comparison of FastText implementations (Facebook vs. Gensim) microsoft/nlp-recipes#56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastText native VS original, different outputs #1940

FastText native VS original, different outputs #1940

menshikh-iv commented Feb 28, 2018

manneshiva commented Mar 1, 2018 •

edited

Loading

menshikh-iv commented Mar 1, 2018

gojomo commented Mar 5, 2018

piskvorky commented Mar 5, 2018 •

edited

Loading

phdowling commented Apr 5, 2018

manneshiva commented Apr 5, 2018 •

edited

Loading

piskvorky commented Apr 5, 2018 •

edited

Loading

gojomo commented Apr 5, 2018

jayantj commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018 •

edited

Loading

Alice-Ke commented Aug 28, 2018

leezu commented Aug 28, 2018 •

edited

Loading

FastText native VS original, different outputs #1940

FastText native VS original, different outputs #1940

Comments

menshikh-iv commented Feb 28, 2018

Intro

How to reproduce

Expected Results

Actual result

manneshiva commented Mar 1, 2018 • edited Loading

menshikh-iv commented Mar 1, 2018

gojomo commented Mar 5, 2018

piskvorky commented Mar 5, 2018 • edited Loading

phdowling commented Apr 5, 2018

manneshiva commented Apr 5, 2018 • edited Loading

piskvorky commented Apr 5, 2018 • edited Loading

gojomo commented Apr 5, 2018

jayantj commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018 • edited Loading

Alice-Ke commented Aug 28, 2018

leezu commented Aug 28, 2018 • edited Loading

manneshiva commented Mar 1, 2018 •

edited

Loading

piskvorky commented Mar 5, 2018 •

edited

Loading

manneshiva commented Apr 5, 2018 •

edited

Loading

piskvorky commented Apr 5, 2018 •

edited

Loading

menshikh-iv commented Apr 6, 2018 •

edited

Loading

leezu commented Aug 28, 2018 •

edited

Loading