FastText memory usage greatly exceeds value returned by `estimate_memory` #1824

jbaiter · 2018-01-03T15:12:11Z

Description

When using gensim.models.fasttext.FastText, the actual memory usage is much higher (>2x) than predicted by FastText.estimate_memory.
My usage scenario is to generate 300-dimensional word embeddings using SkipGram training with window size 8. My corpus has ~55,000,000 documents with ~4,144,457 word types across ~20,000,000,000 tokens. The machine has 16GB of available memory, 15GB of which are available for the Gensim process, as well as 16GB of Swap space.

The estimated memory usage is ~11.2GB (see below), which is identical to the size estimated for the Word2Vec model with the same parameters. Training with Word2Vec works flawlessly and uses almost exactly as much memory as estimated.

It seems that FastText does not implement its own estimate_memory method, but inherits it from the Word2Vec class, yielding unreliable values as can be seen below. The critical section where the most memory is used seems to be this part in FastText.init_ngrams:

all_ngrams = []
for w, v in self.wv.vocab.items():
    self.wv.ngrams_word[w] = compute_ngrams(w, self.min_n, self.max_n)
    all_ngrams += self.wv.ngrams_word[w]

Steps/Code/Corpus to Reproduce

from gensim.models import fasttext

model = fasttext.FastText(size=300, sg=1, window=8, min_count=50, workers=8, iter=5)

# Word frequencies loaded from a finite state transducer on disk, i.e. no memory usage
freqs = load_frequencies()
vocab_size = sum(1 for typ, cnt in freqs.items() if cnt >= 50)
model.estimate_memory(vocab_size=vocab_size, report=True)
# { 'syn0': 4973348400,
#   'syn1neg': 4973348400,
#   'vocab': 2072228500
#   'total': 12018925300 }
# I.e. ~11.2GB, well within the available memory

model.build_vocab_from_freq(freqs, corpus_count=54878750)
# Memory usage is at ~7GB now, identical to Word2Vec

model.init_ngrams()
# ... Killed by OOM killer after swap space has run out

Expected Results

Should finish training without running out of memory.

Actual Results

Runs out of memory.

Versions

Linux-4.10.0-28-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.13.3
SciPy 1.0.0
gensim 3.2.0
FAST_VERSION (fasttext) 1
FAST_VERSION (word2vec) 1

The text was updated successfully, but these errors were encountered:

menshikh-iv · 2018-01-08T09:34:19Z

Thanks for report @jbaiter!

The problem happens because this method isn't overridden in subclass (but must be), @manneshiva can you fix this (looks simple)?

jbaiter · 2018-01-09T08:27:09Z

I attempted to implement it here: jbaiter@4a3bbca

However, I think that implementing the method is only one step. There's a lot of opportunity to reduce the memory overhead of the current implementation:

all_ngrams in the above code snippet uses a lot of memory, but is that huge of a temporary data structure really neccessary?
Why keep around a word -> [ngrams] mapping for each and every word in the vocabular at all? It only seems to be used to look up the ngram bucket, but this can be done on-the-fly with only access to the vocabulary word

I started some naive performance optimizations in a branch, but I don't think I have a complete enough picture of the implementation yet to be confident in those changes.

manneshiva · 2018-01-11T13:45:15Z

@jbaiter I like both your proposals. all_ngrams is not really needed and can indeed be a huge temporary memory overhead. I also agree with discarding word->[ngrams] mappings and calculating the ngrams on-the-fly. Considering that you have Cythonised the compute_ngrams function, this shouldn't have a drastic effect on the performance of the model in terms of time. I have gone through your code and it looks good to me except for a couple of minor issues:

You might have missed syn0_vocab and ngrams in estimate_memory (also in unittest).
Not sure where you would be using compute_num_ngrams.

@jayantj any comments?

jbaiter · 2018-01-11T14:05:36Z

You might have missed syn0_vocab and ngrams in estimate_memory (also in unittest).

Thank you, I'll try to put together a working pull request over the weekend :-)

Not sure where you would be using compute_num_ngrams.

That function is indeed no longer used, I wrote it for an earlier version of the memory estimation

abedkhooli · 2018-02-17T17:42:48Z

Not sure if directly related, but I am trying to load a pre-trained fasttext model (3.9G bin, 1.6G vec) on Google Colab VM (12 GB memory limit) and the model eats over 11 GB ram (warning displayed) after which I can do nothing with it (any call to model kills the runtime). Using Gensim 3.3.
model = FastText.load_fasttext_format('wiki.ar')
Question: how much memory (roughly) would it take to load that model?

piskvorky · 2018-02-17T18:33:42Z

@menshikh-iv @jbaiter @manneshiva what's the status here? It looks like a rather critical feature/bug.

jbaiter · 2018-02-19T16:36:49Z

@piskvorky I got stuck implementing my optimizations, since I became unsure about the bucketing mechanism used for the ngrams. Specifically, from my understanding it seems that with bucketing every ngram should have an embedding in a bucket, even if that ngram never actually occurred in the original corpus.

It would be great if someone more familiar with the code base could look over my changes and offer some guidance/critiques. Should I open a PR for that, even if the code as is currently does not pass the tests?

See my changes here:
develop...jbaiter:fasttext-optimization

menshikh-iv · 2018-02-19T16:59:31Z

@jbaiter of course, feel free to open PR, we'll help you (with tests too)
CC: @manneshiva

jbaiter · 2018-02-19T17:13:55Z

I submitted my WIP PR here: #1916

abedkhooli · 2018-03-01T18:46:15Z

Tried the case above on Gensim 3.4 and it worked. Great work. Thank you all.

menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Jan 8, 2018

menshikh-iv closed this as completed in 9021ea8 Mar 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastText memory usage greatly exceeds value returned by `estimate_memory` #1824

FastText memory usage greatly exceeds value returned by `estimate_memory` #1824

jbaiter commented Jan 3, 2018

menshikh-iv commented Jan 8, 2018

jbaiter commented Jan 9, 2018

manneshiva commented Jan 11, 2018

jbaiter commented Jan 11, 2018

abedkhooli commented Feb 17, 2018

piskvorky commented Feb 17, 2018

jbaiter commented Feb 19, 2018 •

edited

Loading

menshikh-iv commented Feb 19, 2018

jbaiter commented Feb 19, 2018

abedkhooli commented Mar 1, 2018

FastText memory usage greatly exceeds value returned by estimate_memory #1824

FastText memory usage greatly exceeds value returned by estimate_memory #1824

Comments

jbaiter commented Jan 3, 2018

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

menshikh-iv commented Jan 8, 2018

jbaiter commented Jan 9, 2018

manneshiva commented Jan 11, 2018

jbaiter commented Jan 11, 2018

abedkhooli commented Feb 17, 2018

piskvorky commented Feb 17, 2018

jbaiter commented Feb 19, 2018 • edited Loading

menshikh-iv commented Feb 19, 2018

jbaiter commented Feb 19, 2018

abedkhooli commented Mar 1, 2018

FastText memory usage greatly exceeds value returned by `estimate_memory` #1824

FastText memory usage greatly exceeds value returned by `estimate_memory` #1824

jbaiter commented Feb 19, 2018 •

edited

Loading