Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastText native VS original, different outputs #1940

Closed
menshikh-iv opened this issue Feb 28, 2018 · 12 comments
Closed

FastText native VS original, different outputs #1940

menshikh-iv opened this issue Feb 28, 2018 · 12 comments
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills

Comments

@menshikh-iv
Copy link
Contributor

Intro

As the person mentioned in mailing list, he receives different result with a pre-trained model with gensim code & original facebook code.

How to reproduce

  1. Install Facebook FastText
    wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
    unzip v0.1.0.zip
    cd fastText-0.1.0
    make
  2. Download pre-trained vectors from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md (I pick english, only as example), you need bin+text link
  3. Unpack archive with vectors
  4. Try to retreive vectors with gensim
    from gensim.models import FastText
    
    m = FastText.load_fasttext_format("wiki.en.vec")
    print(m["hello"])  # existent word
    print(m["someundefinedword"])  # non-existent word
  5. Repeat it with original FastText implementation
    ./fasttext  print-word-vectors ../wiki.en.bin
    hello
    someundefinedword
  6. Compare vectors from (5) and (6)

Expected Results

Vectors for "hello" and "someundefinedword" exactly same (from gensim & Facebook)

Actual result

Exactly same vectors for "hello", but different for "someundefinedword"

CC: @manneshiva

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Feb 28, 2018
@manneshiva
Copy link
Contributor

manneshiva commented Mar 1, 2018

This behavior is expected as pointed out by this comment in the unit test file. The vector for an OOV word in Gensim is likely to be slightly different compared to the vector obtained from the Original Fasttext implementation for the same OOV word. This is because the Gensim code discards un-used ngram vectors (to save memory) while the Original implementation keeps all the buckets (and hence all ngrams). So it is possible that a new OOV words might contain a few ngrams whose vectors might be missing after the discarding. Such a case is highly unlikely (depending on the bucket size and vocab size) after this PR #1916 (merged after the creation of this issue).
P.S.: The current code does not explicitly store the ngrams anymore, but only the hashes. It also discards unused hashes.

@menshikh-iv
Copy link
Contributor Author

Thanks for detailed explanation @manneshiva 👍

@gojomo
Copy link
Collaborator

gojomo commented Mar 5, 2018

That we might throw out never-encountered (and thus never-trained) n-grams might be an acceptable optimization, for the case where a model is fully trained inside gensim. (After all, if needing arbitrary random untrained vectors for those n-grams later, they can be created later.)

But, to discard them from an externally-trained, loaded, static model, and thus get different vectors than the original FastText in what should be a completely deterministic process, strikes me as a deviation from expected behavior, and thus a bug, despite any explanation. Your thoughts, @piskvorky?

@piskvorky
Copy link
Owner

piskvorky commented Mar 5, 2018

In general we want to stick to the original, yes.

@manneshiva what does "unused" mean for externally loaded models? How much do we gain by modifying the static model's default behaviour there?

I don't have enough intuition about the trade-offs, the pros and cons.

@phdowling
Copy link
Contributor

I guess the expected behavior is that, if I load a model, all the vectors in that model are actually used in the end. I think deleting parts of a loaded binary model is not ideal, especially because it doesn't seem to be clearly pointed out anywhere. So the cons seem relatively clear - you are loading a model that may have been evaluated to have a certain quality, but under gensim you are no longer guaranteed that the quality will hold up. The pros are hard to judge for me, I'm not sure how hard it would be to adapt the code so this doesn't happen.

And like @piskvorky I'm also interested where the determination of "unused" subword vectors is actually made, since there's no training corpus in this setting.

Are there currently plans to change this?

@manneshiva
Copy link
Contributor

manneshiva commented Apr 5, 2018

  1. In order to reduce the ngrams matrix size, the original Fasttext code (by Facebook) does not store a unique vector for each ngram but maps each ngram to a bucket using a hash function. The size of ngrams matrix is thus limited to bucket_size x vector_size. Due to this, there is no guarantee that all the vectors corresponding to each bucket is used/trained. The actual number of vectors (from the ngrams matrix) that are trained/updated will depend on the size of the bucket (default: 2 Million) and the ngrams encountered in the training corpus.

  2. Since the only ngram vectors that are actually trained are the ngrams of the in-vocabulary words, "unused" buckets here refer to the hashes(bucket indices) that do not correspond to the hashes of any ngrams (of in-vocab words).

  3. There has been some discussion around this trade-off between reducing memory usage of the model (without compromising the quality of word vectors) vs deviating from "deterministic/expected behavior". The decision to discard "unused" ngram buckets was a part of the FastText wrapper in Gensim based on the memory profiling result seen in this comment. Also look at the discussions/memory profiling results in this issue.

  4. IMO, it is best to provide the user with an option/parameter to either discard unused vectors (without loss in quality of word vectors) or not.

@piskvorky
Copy link
Owner

piskvorky commented Apr 5, 2018

Ah, if it's just untrained garbage, we should definitely discard that. If someone relies on an exact reproducibility from random/untrained vectors, their app is broken anyway. That's not the kind of compatibility we care about.

@manneshiva I assume the determination of used/unused is straightforward? Or is any guessing/heuristics involved, any room for error?

@gojomo
Copy link
Collaborator

gojomo commented Apr 5, 2018

I would think that if some other FastText implementation (like the original) saves out 'untrained garbage' in its serialized model, and loads-and-uses such noise in its subsequent OOV calculations after re-loading, so that it affects (reproducible) evaluations of the frozen model, then we're not really format compatible if we make the independent decision to discard that noise. And, we'll get a continuing tail of "what's up with this?" questions/bug-reports from that decision to be arbitrarily different in how we load a (frozen, completed, original-tool) model.

@jayantj
Copy link
Contributor

jayantj commented Apr 6, 2018

Along with reproducibility issues, another point of discussion over this was that there is a memory/speed trade-off involved (during loading). Quoting from a previous comment -

"For relatively small vocab sizes (~200k), the steady-state memory usage is 1.1 GB lower than it would be if we chose to keep all ngram vectors as is. (for 300-d vectors). This is at the cost of significantly increased loading time.

Conversely, for large vocab sizes (like for Wikipedia models), we don't reduce memory usage, while also causing much higher load times. (as @gojomo rightly pointed out)

In case the common use case is indeed loading large models, it might make sense to store ngram vectors as is, without trying to discard any unused ones."

For this, Shiva proposed this solution here, which, IMO gives us the best of both worlds, while also providing exactly same behaviour by default as the original FastText, which should reduce the follow up questions/bug reports we get about this. The solution as @manneshiva described it -

"I feel we should give the user an option -- discard_unused_ngrams to save memory, which by default could be False. Since the memory saved for small vocab models is significant (owing to a fewer number of total used ngrmas), this should be helpful for a user trying to load a small vocab model with limited RAM."

@manneshiva @piskvorky @gojomo @menshikh-iv do you see any potential issues with this approach? if not, I think we should go ahead with it.

@gojomo has raised a valid concern that in theory, our heuristic for determining unused vectors could fail, if the serialized model has had the vocabulary trimmed after training and before serializing. However, FastText doesn't do anything of the sort (and neither do any other models in my experience), so I think it's a reasonable assumption to make. A note/info level log in the code could be useful

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Apr 6, 2018

I agree with #1261 (comment) and @jayantj (looks like best variant).

We also definitely need to update documentation in this place (for avoid user confusion).

@Alice-Ke
Copy link

@manneshiva @gojomo Hi, thanks for your answers. I am encountering the similar problem with the author. I am confused about what this 'never-encountered (and thus never-trained) n-grams' comes from ? Is it initialized as a random vector for each bucket at the beginning? Thanks!

@leezu
Copy link

leezu commented Aug 28, 2018

@Alice-Ke yes, all ngram vectors are initialized randomly before training. The gensim implementation throws the ones away that are not used by any of the known words. However, if you now look up the vector for a previously unknown word which contains some new ngram, gensim will return wrong results as it threw away the respective ngram vector. Further difference likely comes from possibly wrong handling of unicode characters: #2059

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills
Projects
None yet
Development

No branches or pull requests

8 participants