Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming Data Memory Growth (reprise) #5083

Closed
dancsalo opened this issue Mar 2, 2020 · 3 comments
Closed

Streaming Data Memory Growth (reprise) #5083

dancsalo opened this issue Mar 2, 2020 · 3 comments
Labels
perf / memory Performance: memory use

Comments

@dancsalo
Copy link

dancsalo commented Mar 2, 2020

How to reproduce the behaviour

According to #1424, nlp.vocab.strings in the snippet below should not increase in length beyond 10k. Yet when I run the code, nlp.vocab.strings increases without bound, well beyond 10k.

from __future__ import unicode_literals
import spacy
import numpy.random

def generate_strings():
    while True:
        yield str(numpy.random.random())

def main():
    nlp = spacy.blank('xx')
    nlp.tokenizer.infix_finditer = None
    for i, doc in enumerate(nlp.pipe(generate_strings())):
        if not i % 1000:
            print(i, len(nlp.vocab.strings))

if __name__ == '__main__':
    main()

I'm running spaCy in a web app to process streaming data and have encountered a memory leak, similar to #285. I believe this issue is causing it.

Your Environment

  • spaCy version: 2.2.3
  • Platform: Darwin-17.7.0-x86_64-i386-64bit
  • Python version: 3.7.5
@svlandeg svlandeg added the perf / memory Performance: memory use label Mar 2, 2020
@adrianeboyd
Copy link
Contributor

adrianeboyd commented Mar 4, 2020

Thanks for the report!

First, I found a memory leak due to an extraneous iterator that should be fixed by #5101.

Second, on the StringStore: you're right that the StringStore is allowed to keep growing. Let's see, #1424 is an older version, here's the current version, which doesn't seem to work:

spaCy/spacy/language.py

Lines 807 to 835 in 2281c47

# Track weakrefs of "recent" documents, so that we can see when they
# expire from memory. When they do, we know we don't need old strings.
# This way, we avoid maintaining an unbounded growth in string entries
# in the string store.
recent_refs = weakref.WeakSet()
old_refs = weakref.WeakSet()
# Keep track of the original string data, so that if we flush old strings,
# we can recover the original ones. However, we only want to do this if we're
# really adding strings, to save up-front costs.
original_strings_data = None
nr_seen = 0
for doc in docs:
yield doc
if cleanup:
recent_refs.add(doc)
if nr_seen < 10000:
old_refs.add(doc)
nr_seen += 1
elif len(old_refs) == 0:
old_refs, recent_refs = recent_refs, old_refs
if original_strings_data is None:
original_strings_data = list(self.vocab.strings)
else:
keys, strings = self.vocab.strings._cleanup_stale_strings(
original_strings_data
)
self.vocab._reset_cache(keys, strings)
self.tokenizer._reset_cache(keys)
nr_seen = 0

This is checking every 10K docs for older strings, not limiting the size to 10K strings. (Edit: sorry, you were right, I misread!) There are a couple of problems, the main one that self.vocab._reset_cache(keys, strings) isn't implemented, so you can't run this out-of-the-box at all. When I tried to run it without the vocab cache resetting, it was very slow and sometimes hung, so I wouldn't recommend trying to use this right now. (I think it was implemented for a much older version of spacy.)

In general, the vocab size should be limited to basically max(size of vocab for a loaded model, 10K), but the StringStore is allowed to grow unbounded, since it's impossible to know which saved docs you may want to inspect in the future that were processed with this pipeline.

If you know for sure that you won't need to inspect older docs, you can periodically reduce the StringStore back to the minimum, which is: all strings required by the model + all items in the vocab. Here's an updated version of the script that keeps track of the minimum strings, with slightly more realistic test documents:

import spacy
import random
import string

def generate_strings(as_tuples=False):
    while True:
        s = ''.join([random.choice("          " + string.ascii_letters + string.digits) for n in range(20)])
        yield s

def main():
    nlp = spacy.blank('xx')
    minimal_strings = set(nlp.vocab.strings) | set([nlp.vocab.strings[lex.orth] for lex in nlp.vocab])
    current_vocab_size = len(nlp.vocab)
    for i, doc in enumerate(nlp.pipe(generate_strings())):
        if not i % 10000:
            print(i, len(nlp.vocab), len(nlp.vocab.strings), doc.text)
            if len(nlp.vocab) > current_vocab_size:
                minimal_strings.update([nlp.vocab.strings[lex.orth] for lex in nlp.vocab])
                current_vocab_size = len(nlp.vocab)

            nlp.vocab.strings._reset_and_load(minimal_strings)

if __name__ == '__main__':
    main()

You will see that the vocab grows slightly beyond 10K because the vocab currently allows very short lexemes to be added at any point, but this levels off relatively quickly.

You can easily save an intermediate StringStore to disk if you want to be able to process older docs. It's really nothing more than a list of strings. Or you can decide how to determine which strings are stale for your task and remove those from the StringStore without always removing everything except the minimally required strings. The basic idea in #1424 is reasonable, so it may be possible to fix it without too much effort, but I haven't looked into the details.

Also be aware that this will still have memory problems with multiprocessing and an infinite text generator. I'm not quite sure yet why, although it may also be related to the use of iterator.tee(). I think if you call .pipe(n_process=>1) with finite batches that you won't run into this problem.

@dancsalo
Copy link
Author

@adrianeboyd thank you for your detailed response!

The missing implementation of self.vocab._reset_cache(keys, strings) explains my confusion. I am also happy to know that nlp.vocab.strings._reset_and_load() exists. I had imagined a function like this exists, and that should do the trick in our use case. We are using spaCy in a streaming fashion and are not concerned with processing older docs.

Your code snippet works as expected, and I appreciate the more realistic string generation 😃

@lock
Copy link

lock bot commented Apr 15, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Apr 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
perf / memory Performance: memory use
Projects
None yet
Development

No branches or pull requests

3 participants