-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FastText memory usage greatly exceeds value returned by estimate_memory
#1824
Comments
Thanks for report @jbaiter! The problem happens because this method isn't overridden in subclass (but must be), @manneshiva can you fix this (looks simple)? |
I attempted to implement it here: jbaiter@4a3bbca However, I think that implementing the method is only one step. There's a lot of opportunity to reduce the memory overhead of the current implementation:
I started some naive performance optimizations in a branch, but I don't think I have a complete enough picture of the implementation yet to be confident in those changes. |
@jbaiter I like both your proposals.
@jayantj any comments? |
Thank you, I'll try to put together a working pull request over the weekend :-)
That function is indeed no longer used, I wrote it for an earlier version of the memory estimation |
Not sure if directly related, but I am trying to load a pre-trained fasttext model (3.9G bin, 1.6G vec) on Google Colab VM (12 GB memory limit) and the model eats over 11 GB ram (warning displayed) after which I can do nothing with it (any call to model kills the runtime). Using Gensim 3.3. |
@menshikh-iv @jbaiter @manneshiva what's the status here? It looks like a rather critical feature/bug. |
@piskvorky I got stuck implementing my optimizations, since I became unsure about the bucketing mechanism used for the ngrams. Specifically, from my understanding it seems that with bucketing every ngram should have an embedding in a bucket, even if that ngram never actually occurred in the original corpus. It would be great if someone more familiar with the code base could look over my changes and offer some guidance/critiques. Should I open a PR for that, even if the code as is currently does not pass the tests? See my changes here: |
@jbaiter of course, feel free to open PR, we'll help you (with tests too) |
I submitted my WIP PR here: #1916 |
Tried the case above on Gensim 3.4 and it worked. Great work. Thank you all. |
Description
When using
gensim.models.fasttext.FastText
, the actual memory usage is much higher (>2x) than predicted byFastText.estimate_memory
.My usage scenario is to generate 300-dimensional word embeddings using SkipGram training with window size 8. My corpus has ~55,000,000 documents with ~4,144,457 word types across ~20,000,000,000 tokens. The machine has 16GB of available memory, 15GB of which are available for the Gensim process, as well as 16GB of Swap space.
The estimated memory usage is ~11.2GB (see below), which is identical to the size estimated for the Word2Vec model with the same parameters. Training with Word2Vec works flawlessly and uses almost exactly as much memory as estimated.
It seems that
FastText
does not implement its ownestimate_memory
method, but inherits it from theWord2Vec
class, yielding unreliable values as can be seen below. The critical section where the most memory is used seems to be this part inFastText.init_ngrams
:Steps/Code/Corpus to Reproduce
Expected Results
Should finish training without running out of memory.
Actual Results
Runs out of memory.
Versions
The text was updated successfully, but these errors were encountered: