-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSI worker getting "stuck" #2317
Comments
Hi @robguinness Can you increase logging level of lsi_worker to debug value (gensim.models.lsi_worker, line with |
Will do ASAP. It might be a day or two before I can get back to this. |
Just a small update: I have restarted the modeling process with logging level now set to DEBUG. I'll update when/if it gets stuck again. |
I hit the locked up situation again this morning. Here is a log from the LSI worker that is locked up:
Has been in this state for nearly 2 hours now. |
Hi @robguinness, that seems to be some issue with the BLAS library you are using. Gensim depends on those to do QR, via scipy (see here). What's your BLAS? import numpy, scipy
print(numpy.show_config())
print(scipy.show_config()) |
Ok, yes, it could be there is some problem with our BLAS setup. Here is the output:
|
Hm, I wonder why numpy and scipy show different compilation (language When you CTRL+C interrupt the stuck program -- what is the traceback? Which Gensim line exactly did it get stuck on? Are there any memory issues on the machine? (IIRC, this is one of the most memory-intensive spots in LSI) |
I agree, looks not like a stuck, looks like slow QR (because input matrix is pretty big I guess). |
Nah, QR should take seconds, not hours. This looks stuck. But I'm wondering where exactly, and why. It could be some OpenBlas bug, or the machine went OOM and killed some critical processes… |
IMHO, I don't think it was OOM. The machine still had quite a lot of memory available when this happened. I have now compiled numpy and scipy against ATLAS and restarted the process. I have understood that OpenBlas has some issues, so I hope ATLAS will be more stable.
I don't have the exact traceback available, but from the main script, it was on Also, regarding OOM possibility, this line was inside a try/except MemoryError. Not sure that would trigger though if the lsi_worker process itself caused the OOM. |
Can you send the exact worker traceback next time it gets stuck? Not sure what you mean by |
Sure thing.
In my code. It looks like this: try:
lsi_model.add_documents(corpus[i:current_index])
except MemoryError as error:
print_memory_usage()
print(error)
exc_type, exc_value, exc_traceback = sys.exc_info()
traceback.print_tb(exc_traceback) |
An update...I am now re-running the process, now using libatlas as the underlying BLAS. The workers aren't getting stuck, which is good, but there seems to be another issue...the memory usage of the workers is steadily increasing, which to me seems a bit odd. At least, I've never seen this happen before. The workers should release their memory after each job, correct? |
Yes, they do release memory after each job. The Gensim LSI code is pure Python, so that cannot leak. But it could be some artefact of Python memory management, or some other dependency leaking. How much of an increase are you seeing (absolute + time derivative)? Does it stop after a while, or keeps growing steadily? |
The process has been running for about 6.5 hours, and now the workers are consuming about 24-25 GB each. The growth seems pretty steady, and my guess is it will keep rising until throwing an OOM error. I will add some instrumentation to print more precise memory stats to the log, since now I am just going based on monitoring htop. I would assume the memory leak is in libatlas, since this did not happen before switching from OpenBlas to libatlas. Any recommendations on which BLAS to use (Linux, please)? |
I use OpenBlas, or if not available, ATLAS. I'm wondering if it's something with your particular setup or HW, as opposed to the library itself. It's really strange. @menshikh-iv were there any changes to LSI on our side lately? Any cythonization, refactoring, …? |
@piskvorky in 2018 - only cleanup (improve documentation, better six usage, etc), no optimizations or refactoring. |
Ok, I'm going to try updating libatlas, and if it still leaks, then I will try switching to OpenBlas. I will also try to build OpenBlas from source on this machine. I'll keep you updated. BTW, should I open up a different Github issue, since the memory leak seems to be a completely issue than the original one? (I could be wrong about that, but my gut tells me they are unrelated.) |
No, let's keep it in one place. I'm still not sure what the actual issue is, the BLAS leak is just a hypothesis (very strange, these libs are well-tested and used widely). @robguinness how dense is your |
Sure, here are some stats: n_docs: 7138401 So it's pretty sparse, and I wouldn't expect any steep spikes in denseness. I've upgraded the system to Python 3.7.2, but workers seem to be still using increasingly more memory as the process runs. So I suspect a leak somewhere, but I'm not sure where either. |
Thanks. What's the maximum nnz in one doc? Also, what the type of |
Here is the max, as well as a few other statistics concerning nnz that may (or may not) be useful: max: 17687 I'm using ShardedCorpus. |
OK, thanks. That looks fine. I'm not very familiar with MmCorpus.serialize("/tmp/corpus.mm", corpus) (you can also compress the resulting |
Do you mean convert the sharded corpus to an MmCorpus, or regenerate the corpus from scratch as an MmCorpus? I naively tried the first option, i.e:
But this fails with the following output:
If it is not possible to convert a ShardedCorpus to MmCorpus, then I can regenerate it, but it will take awhile. |
I meant convert it (not re-generate). Reading the ShardedCorpus docs, apparently iterating over it works differently than other Gensim corpora. To get standard behaviour, you'll need to set # Convert ShardedCorpus into MatrixMarket format.
sharded_corpus = ShardedCorpus.load(corpus_path)
sharded_corpus.gensim = True
MmCorpus.serialize("/data/tmp/corpus.mm", sharded_corpus)
# Optionally, manually compress /data/tmp/corpus.mm to /data/tmp/corpus.mm.gz
# Load the converted MatrixMarket corpus, keep working with that from now on.
mm_corpus = MmCorpus("/data/tmp/corpus.mm.gz") I suspect this might also be related to your memory woes: the ShardedCorpus iterator returns vectors of a different type by default (numpy arrays), whereas all models expect in the vectors in standard Gensim format. Let me know if setting the @robguinness Why are you using ShardedCorpus in the first place? |
I actually tried that already, and it still gave the same error. I will try again to be doubly-sure. Regarding the choice of ShardedCorpus, this was made before I joined the company, but I think the thinking was probably due to the large size of the corpus. |
I just checked from our code, and the
|
@robguinness check what ShardedCorpus generates. it should be docs with in a bow format. I think |
@robguinness what does |
(I'm not really sure what the argument I'm currently trying a small hack to load the entire corpus into memory as a list: corpus_all = []
for c_gen in corpus:
for c in c_gen:
corpus_all.append(c) Not sure if it will fit though... |
Oh, I'm an idiot. Of course, line
I've copypasted exactly function callings from code to ease orientation in code. Thanks for a great answer with a slice of shell. It's very helpful.
MmWriter can receive (according to code) an iterator of docs if you don't care or don't have metadata in corpus. So you can minimize memory consumption and not load all in memory. |
@robguinness that looks like a bug in ShardedCorpus, the result ought to be a sparse vector in Gensim format: a list of 2-tuples Can you give me |
I managed to create an MmCorpus and use that as the input to the LsiModel, but the same leaky behavior occurs. After each job, the amount of memory used by a worker increases by roughly the same amount. I think the problem is really that the workers are not releasing the memory, as they should. |
@robguinness That doesn't look right, and will not work. Can you send your output for |
@piskvorky bug in the |
Last night I added some memory usage logging via the
Here are lines 252-256 of if self.u is None:
# we are empty => result of merge is the other projection, whatever it is
self.u = other.u.copy()
self.s = other.s.copy()
return I am wondering if perhaps the P.S. I have read elsewhere that, while classical memory leaks are not possible within Python, they can occur in Python applications when using C extensions OR memory can grow unexpectedly (but technically not a memory leak) due to things like circular references. See, for example: |
@robguinness and this is with MmCorpus, right? Because while the iteration output for MmCorpus is correct, the output for ShardedCorpus is broken. I don't understand how ShardedCorpus even worked for you, that should have been throwing exceptions right away. Weird. If you're seeing the leak with MmCorpus too, the only way forward is to debug in detail. Does the issue happen even in serial mode (no distributed workers)? Or try an older version such as 0.13.4 (LSI didn't significantly change in years), to rule out hidden errors due some recent refactoring. |
The above output was using ShardedCorpus with all ~7 million docs, but I saw the same behavior using MmCorpus with a smaller number of docs (since the full set won't fit into memory). That's really strange about ShardedCorpus because we have used this in our code for the last few years without seeing this particular.
In serial mode, there are no workers, right, so I wouldn't see quite the same signature. But I can do some tests to see if the memory of the main process progressively grows. I can try downgrading gensim, too, but I'm really starting to suspect that the problem is somewhere below gensim...e.g. numpy, scipy, BLAS. You really mean 0.13.4 of gensim? ...that is quite old indeed! ;-) |
MmCorpus doesn't load the full document set into memory, so that's no reason to use ShardedCorpus. I'd propose you avoid ShardedCorpus, not sure what's going on there… maybe some recent refactoring… I'm not familiar with that module. But I'd expect our unit tests to catch such a fundamental problem, strange. Either way, can you try training with MmCorpus? @robguinness How much gain do you see from using distributed LSI? How many workers / how many hours, compared to a simple serial And yes, 0.13.4 is old… but LSI/SVD is even older :-) Let's start by ditching ShardedCorpus first though. |
Here is some output from a run using MmCorpus (still with gensim 3.6.0, I'll try downgrading next). The process has been running for about an hour, and you can see the number of objects and overall size increases with each job (see
If I let it run for a couple more hours, I'm quite sure it will keep increasing. I will kill it now though and try downgrading to 0.13.4 to see if this issue goes away.
Quite a bit of gain, but I don't have exact numbers handy. In this run I am using 8 workers, but depending on the other parameters (BOW dimension, number of topics, etc.), I have used as few as 4 as many as 16. |
FYI, downgrading to gensim 0.13.4 also causes downgrades to numpy-1.16.0 and scipy-1.2.0. Just a note that if the problem goes away, it might be hard to trace down the cause, since we won't know if it was a bug in the more recent releases of these packages. |
I got an error when trying to run under gensim 0.13.4:
It seems that the phraser that was built with gensim 3.6.0 is not compatible with 0.14.3. I would have to regenerate it, which for the whole 7 million documents will take awhile. So please give me a few days. |
@robguinness I think problem is not in gensim itself, but libatlas. Of course, copy of big array causes a large allocations, but tracemalloc shows allocations of python only. so if memory leak is in other library we will not see this. |
@horpto I'm not sure I completely follow you. The memory usage of Python is growing in the worker, as can be seen with |
What do these What does @robguinness Before downgrading, I'd urge you to try training with MmCorpus (it's streamed, fast and doesn't load all docs into RAM). No ShardedCorpus. Especially if you can reproduce the issue in serial LSI mode too, that simplifies testing and replication tremendously.
|
@robguinness ok, you're right. I've tested tracemalloc on this example: import tracemalloc as tm
def test():
return [i*i for i in range(1_000_000)]
tm.start()
snap1 = tm.take_snapshot()
a=test()
snap2 = tm.take_snapshot()
top_stats = snap2.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
print(stat)
print("DIFF")
top_stats = snap2.compare_to(snap1, 'lineno')
for stat in top_stats[:10]:
print(stat)
a=test()
snap3 = tm.take_snapshot()
top_stats = snap3.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
print(stat)
print("DIFF")
top_stats = snap3.compare_to(snap2, 'lineno')
for stat in top_stats[:10]:
print(stat)
print("DIFF2")
top_stats = snap3.compare_to(snap1, 'lineno')
for stat in top_stats[:10]:
print(stat) |
If turn off distributed mode will memory issue stay? (I think yes, it stays) |
I'm out sick today, so I'll probably get back to troubleshooting on Monday. One small comment for @piskvorky...you are right, Phraser is completely unrelated to this. We were unnecessarily loading the serialized Phraser object in our code. I can just get rid of that line completely. Thanks for the help and suggestions from everyone. |
Hi all, I have tentatively tracked the memory leak down to scipy/numpy. I'm using an older release of both libraries now, and the problem seems to have gone away. I will try to reproduce the problem and narrow down the problem more conclusively, but at the moment, we need to get this model built so the machine is unavailable. |
No problem. Make sure to replace |
@robguinness How's it going? Were you able to pin down the source of the problem? Please let us know. I'm tempted to close this because it seems to be unrelated to gensim directly. |
We eventually figured out (somewhat conclusively) that the reason had to do with the fact that we were building scipy/numpy off the master branches of those repositories. In the case of both scipy and numpy, the master branches are actually the latest development branches and rather unstable. Once we built scipy and numpy off of official release branches (tags), the problem went away. So my best guess is that the problem was upstream in either numpy or scipy. |
Thanks @robguinness . Closing here. |
Description
When building an LsiModel in distributed mode, one of the workers gets "stuck" while orthonormalizing the action matrix. This stalls the whole process of building the model, as the dispatcher hangs on "reached the end of input; now waiting for all remaining jobs to finish".
Steps/Code/Corpus to Reproduce
LSI dispatcher and workers are initialized in separate bash script. I have tried with the number of LSI workers set to 16 and 8.
Gensim version: 3.6.0
Pyro4 version: 4.63
Expected Results
Process should run to completion
Actual Results
Main script output:
Output of LSI worker that is stuck:
CPU for that LSI worker has been ~100% for >24 hours.
Versions
Linux-4.10.0-38-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.15.2
SciPy 1.1.0
gensim 3.6.0
FAST_VERSION 1
The text was updated successfully, but these errors were encountered: