AuthorTopicModel memory issue #1947
Labels
bug
Issue described a bug
difficulty medium
Medium issue: required good gensim understanding & python skills
performance
Issue related to performance (in HW meaning)
Intro
Recently, I often get negative feedback about ATM.
Еhe main reason is memory issues (too much memory consuming), related mailing list threads (latest):
I decided to figure out what was going on.
Investigation
I run ATM based on data provided by the author of https://groups.google.com/forum/#!searchin/gensim/author|sort:date/gensim/gG7aiNI1v-Y/SWPMuP8BAwAJ (so far I can't publish it right now, I'm waiting for permission from its owner).
Basic stats of data:
author2doc
mapping: 106133author2doc
mapping: 73248I run it with a debugger and found that hugest memory-consuming happens here:
https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/models/atmodel.py#L680-L684
I stop it when process already consume 8GB of RAM, some useful statistics presented in table
len(author2doc.keys())
author2doc.keys().index(_)
len(train_corpus_idx)
train_corpus_idx
is hugest memory consumer. Here, we essentially load the whole corpus into memory (and this isn't "online" or "batch" processing)By simple calculations, when the cycle will be done, the process will consume
~232GB of RAM
.This is definitely unacceptable and doesn't allow to use model even for some learning tasks (I'm not even talking about "real" tasks).
@olavurmortensen can you look into this problem, this is supercritical?
Related PR - #893.
The text was updated successfully, but these errors were encountered: