-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming Data Memory Growth (reprise) #5083
Comments
Thanks for the report! First, I found a memory leak due to an extraneous iterator that should be fixed by #5101. Second, on the Lines 807 to 835 in 2281c47
In general, the vocab size should be limited to basically If you know for sure that you won't need to inspect older docs, you can periodically reduce the
You will see that the vocab grows slightly beyond 10K because the vocab currently allows very short lexemes to be added at any point, but this levels off relatively quickly. You can easily save an intermediate Also be aware that this will still have memory problems with multiprocessing and an infinite text generator. I'm not quite sure yet why, although it may also be related to the use of |
@adrianeboyd thank you for your detailed response! The missing implementation of Your code snippet works as expected, and I appreciate the more realistic string generation 😃 |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behaviour
According to #1424,
nlp.vocab.strings
in the snippet below should not increase in length beyond 10k. Yet when I run the code,nlp.vocab.strings
increases without bound, well beyond 10k.I'm running spaCy in a web app to process streaming data and have encountered a memory leak, similar to #285. I believe this issue is causing it.
Your Environment
The text was updated successfully, but these errors were encountered: