Streaming Data Memory Growth (reprise) #5083

dancsalo · 2020-03-02T19:27:01Z

How to reproduce the behaviour

According to #1424, nlp.vocab.strings in the snippet below should not increase in length beyond 10k. Yet when I run the code, nlp.vocab.strings increases without bound, well beyond 10k.

from __future__ import unicode_literals
import spacy
import numpy.random

def generate_strings():
    while True:
        yield str(numpy.random.random())

def main():
    nlp = spacy.blank('xx')
    nlp.tokenizer.infix_finditer = None
    for i, doc in enumerate(nlp.pipe(generate_strings())):
        if not i % 1000:
            print(i, len(nlp.vocab.strings))

if __name__ == '__main__':
    main()

I'm running spaCy in a web app to process streaming data and have encountered a memory leak, similar to #285. I believe this issue is causing it.

Your Environment

spaCy version: 2.2.3
Platform: Darwin-17.7.0-x86_64-i386-64bit
Python version: 3.7.5

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2020-03-04T14:24:24Z

Thanks for the report!

First, I found a memory leak due to an extraneous iterator that should be fixed by #5101.

Second, on the StringStore: you're right that the StringStore is allowed to keep growing. Let's see, #1424 is an older version, here's the current version, which doesn't seem to work:

spaCy/spacy/language.py

Lines 807 to 835 in 2281c47

    
           # Track weakrefs of "recent" documents, so that we can see when they 
        
           # expire from memory. When they do, we know we don't need old strings. 
        
           # This way, we avoid maintaining an unbounded growth in string entries 
        
           # in the string store. 
        
           recent_refs = weakref.WeakSet() 
        
           old_refs = weakref.WeakSet() 
        
           # Keep track of the original string data, so that if we flush old strings, 
        
           # we can recover the original ones. However, we only want to do this if we're 
        
           # really adding strings, to save up-front costs. 
        
           original_strings_data = None 
        
           nr_seen = 0 
        
           for doc in docs: 
        
               yield doc 
        
               if cleanup: 
        
                   recent_refs.add(doc) 
        
                   if nr_seen < 10000: 
        
                       old_refs.add(doc) 
        
                       nr_seen += 1 
        
                   elif len(old_refs) == 0: 
        
                       old_refs, recent_refs = recent_refs, old_refs 
        
                       if original_strings_data is None: 
        
                           original_strings_data = list(self.vocab.strings) 
        
                       else: 
        
                           keys, strings = self.vocab.strings._cleanup_stale_strings( 
        
                               original_strings_data 
        
                           ) 
        
                           self.vocab._reset_cache(keys, strings) 
        
                           self.tokenizer._reset_cache(keys) 
        
                       nr_seen = 0

~~This is checking every 10K docs for older strings, not limiting the size to 10K strings.~~ (Edit: sorry, you were right, I misread!) There are a couple of problems, the main one that self.vocab._reset_cache(keys, strings) isn't implemented, so you can't run this out-of-the-box at all. When I tried to run it without the vocab cache resetting, it was very slow and sometimes hung, so I wouldn't recommend trying to use this right now. (I think it was implemented for a much older version of spacy.)

In general, the vocab size should be limited to basically max(size of vocab for a loaded model, 10K), but the StringStore is allowed to grow unbounded, since it's impossible to know which saved docs you may want to inspect in the future that were processed with this pipeline.

If you know for sure that you won't need to inspect older docs, you can periodically reduce the StringStore back to the minimum, which is: all strings required by the model + all items in the vocab. Here's an updated version of the script that keeps track of the minimum strings, with slightly more realistic test documents:

import spacy
import random
import string

def generate_strings(as_tuples=False):
    while True:
        s = ''.join([random.choice("          " + string.ascii_letters + string.digits) for n in range(20)])
        yield s

def main():
    nlp = spacy.blank('xx')
    minimal_strings = set(nlp.vocab.strings) | set([nlp.vocab.strings[lex.orth] for lex in nlp.vocab])
    current_vocab_size = len(nlp.vocab)
    for i, doc in enumerate(nlp.pipe(generate_strings())):
        if not i % 10000:
            print(i, len(nlp.vocab), len(nlp.vocab.strings), doc.text)
            if len(nlp.vocab) > current_vocab_size:
                minimal_strings.update([nlp.vocab.strings[lex.orth] for lex in nlp.vocab])
                current_vocab_size = len(nlp.vocab)

            nlp.vocab.strings._reset_and_load(minimal_strings)

if __name__ == '__main__':
    main()

You will see that the vocab grows slightly beyond 10K because the vocab currently allows very short lexemes to be added at any point, but this levels off relatively quickly.

You can easily save an intermediate StringStore to disk if you want to be able to process older docs. It's really nothing more than a list of strings. Or you can decide how to determine which strings are stale for your task and remove those from the StringStore without always removing everything except the minimally required strings. The basic idea in #1424 is reasonable, so it may be possible to fix it without too much effort, but I haven't looked into the details.

Also be aware that this will still have memory problems with multiprocessing and an infinite text generator. I'm not quite sure yet why, although it may also be related to the use of iterator.tee(). I think if you call .pipe(n_process=>1) with finite batches that you won't run into this problem.

dancsalo · 2020-03-10T15:26:39Z

@adrianeboyd thank you for your detailed response!

The missing implementation of self.vocab._reset_cache(keys, strings) explains my confusion. I am also happy to know that nlp.vocab.strings._reset_and_load() exists. I had imagined a function like this exists, and that should do the trick in our use case. We are using spaCy in a streaming fashion and are not concerned with processing older docs.

Your code snippet works as expected, and I appreciate the more realistic string generation 😃

lock · 2020-04-15T07:22:46Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added the perf / memory Performance: memory use label Mar 2, 2020

This was referenced Mar 3, 2020

set vector of merged entity #5085

Merged

Remove unnecessary iterator in Language.pipe #5101

Merged

dancsalo closed this as completed Mar 10, 2020

lock bot locked as resolved and limited conversation to collaborators Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming Data Memory Growth (reprise) #5083

Streaming Data Memory Growth (reprise) #5083

dancsalo commented Mar 2, 2020 •

edited

Loading

adrianeboyd commented Mar 4, 2020 •

edited

Loading

dancsalo commented Mar 10, 2020

lock bot commented Apr 15, 2020

Streaming Data Memory Growth (reprise) #5083

Streaming Data Memory Growth (reprise) #5083

Comments

dancsalo commented Mar 2, 2020 • edited Loading

How to reproduce the behaviour

Your Environment

adrianeboyd commented Mar 4, 2020 • edited Loading

dancsalo commented Mar 10, 2020

lock bot commented Apr 15, 2020

dancsalo commented Mar 2, 2020 •

edited

Loading

adrianeboyd commented Mar 4, 2020 •

edited

Loading