Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastText memory usage greatly exceeds value returned by estimate_memory #1824

Closed
jbaiter opened this issue Jan 3, 2018 · 10 comments
Closed
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix

Comments

@jbaiter
Copy link
Contributor

jbaiter commented Jan 3, 2018

Description

When using gensim.models.fasttext.FastText, the actual memory usage is much higher (>2x) than predicted by FastText.estimate_memory.
My usage scenario is to generate 300-dimensional word embeddings using SkipGram training with window size 8. My corpus has ~55,000,000 documents with ~4,144,457 word types across ~20,000,000,000 tokens. The machine has 16GB of available memory, 15GB of which are available for the Gensim process, as well as 16GB of Swap space.

The estimated memory usage is ~11.2GB (see below), which is identical to the size estimated for the Word2Vec model with the same parameters. Training with Word2Vec works flawlessly and uses almost exactly as much memory as estimated.

It seems that FastText does not implement its own estimate_memory method, but inherits it from the Word2Vec class, yielding unreliable values as can be seen below. The critical section where the most memory is used seems to be this part in FastText.init_ngrams:

all_ngrams = []
for w, v in self.wv.vocab.items():
    self.wv.ngrams_word[w] = compute_ngrams(w, self.min_n, self.max_n)
    all_ngrams += self.wv.ngrams_word[w]

Steps/Code/Corpus to Reproduce

from gensim.models import fasttext

model = fasttext.FastText(size=300, sg=1, window=8, min_count=50, workers=8, iter=5)

# Word frequencies loaded from a finite state transducer on disk, i.e. no memory usage
freqs = load_frequencies()
vocab_size = sum(1 for typ, cnt in freqs.items() if cnt >= 50)
model.estimate_memory(vocab_size=vocab_size, report=True)
# { 'syn0': 4973348400,
#   'syn1neg': 4973348400,
#   'vocab': 2072228500
#   'total': 12018925300 }
# I.e. ~11.2GB, well within the available memory

model.build_vocab_from_freq(freqs, corpus_count=54878750)
# Memory usage is at ~7GB now, identical to Word2Vec

model.init_ngrams()
# ... Killed by OOM killer after swap space has run out

Expected Results

Should finish training without running out of memory.

Actual Results

Runs out of memory.

Versions

Linux-4.10.0-28-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.13.3
SciPy 1.0.0
gensim 3.2.0
FAST_VERSION (fasttext) 1
FAST_VERSION (word2vec) 1
@menshikh-iv
Copy link
Contributor

Thanks for report @jbaiter!

The problem happens because this method isn't overridden in subclass (but must be), @manneshiva can you fix this (looks simple)?

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Jan 8, 2018
@jbaiter
Copy link
Contributor Author

jbaiter commented Jan 9, 2018

I attempted to implement it here: jbaiter@4a3bbca

However, I think that implementing the method is only one step. There's a lot of opportunity to reduce the memory overhead of the current implementation:

  • all_ngrams in the above code snippet uses a lot of memory, but is that huge of a temporary data structure really neccessary?
  • Why keep around a word -> [ngrams] mapping for each and every word in the vocabular at all? It only seems to be used to look up the ngram bucket, but this can be done on-the-fly with only access to the vocabulary word

I started some naive performance optimizations in a branch, but I don't think I have a complete enough picture of the implementation yet to be confident in those changes.

@manneshiva
Copy link
Contributor

@jbaiter I like both your proposals. all_ngrams is not really needed and can indeed be a huge temporary memory overhead. I also agree with discarding word->[ngrams] mappings and calculating the ngrams on-the-fly. Considering that you have Cythonised the compute_ngrams function, this shouldn't have a drastic effect on the performance of the model in terms of time. I have gone through your code and it looks good to me except for a couple of minor issues:

  1. You might have missed syn0_vocab and ngrams in estimate_memory (also in unittest).
  2. Not sure where you would be using compute_num_ngrams.

@jayantj any comments?

@jbaiter
Copy link
Contributor Author

jbaiter commented Jan 11, 2018

You might have missed syn0_vocab and ngrams in estimate_memory (also in unittest).

Thank you, I'll try to put together a working pull request over the weekend :-)

Not sure where you would be using compute_num_ngrams.

That function is indeed no longer used, I wrote it for an earlier version of the memory estimation

@abedkhooli
Copy link

Not sure if directly related, but I am trying to load a pre-trained fasttext model (3.9G bin, 1.6G vec) on Google Colab VM (12 GB memory limit) and the model eats over 11 GB ram (warning displayed) after which I can do nothing with it (any call to model kills the runtime). Using Gensim 3.3.
model = FastText.load_fasttext_format('wiki.ar')
Question: how much memory (roughly) would it take to load that model?

@piskvorky
Copy link
Owner

@menshikh-iv @jbaiter @manneshiva what's the status here? It looks like a rather critical feature/bug.

@jbaiter
Copy link
Contributor Author

jbaiter commented Feb 19, 2018

@piskvorky I got stuck implementing my optimizations, since I became unsure about the bucketing mechanism used for the ngrams. Specifically, from my understanding it seems that with bucketing every ngram should have an embedding in a bucket, even if that ngram never actually occurred in the original corpus.

It would be great if someone more familiar with the code base could look over my changes and offer some guidance/critiques. Should I open a PR for that, even if the code as is currently does not pass the tests?

See my changes here:
develop...jbaiter:fasttext-optimization

@menshikh-iv
Copy link
Contributor

@jbaiter of course, feel free to open PR, we'll help you (with tests too)
CC: @manneshiva

@jbaiter
Copy link
Contributor Author

jbaiter commented Feb 19, 2018

I submitted my WIP PR here: #1916

@abedkhooli
Copy link

Tried the case above on Gensim 3.4 and it worked. Great work. Thank you all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix
Projects
None yet
Development

No branches or pull requests

5 participants