💫 Fix streaming data memory growth (!!) #1424

honnibal · 2017-10-16T17:52:59Z

This patch follows through on the changes to the StringStore introduced in v2, and fixes by far the longest open bug: #285 Streaming Data Memory Growth

The solution implemented allows a weakref to be created on Doc objects, so that we can figure out when it's safe to flush out old strings. A rolling buffer is created using two StringStore objects. One StringStore contains the set of original strings and strings from recent documents. The other string store is the one current on the Vocab object. The only difference between the two StringStore objects are in strings created by increasingly old documents. Once we figure out that all of those old documents have passed out of scope and been reference counted, we figure it's safe to flush the obsolete strings.

This logic is only applied in the Language.pipe() method, and requires no API changes. However, it does require breaking the former 2-cycle between the Doc and Token objects, that had been created by caching the Token instances within the Doc. I think breaking this cycle is a Good Thing, independent of the StringStore changes. We don't need to cache those objects --- almost nothing is done in Token.__init__, so there should be no problem with recreating the object. If we like, we could use a weakref on Token, but it's simpler to just not cache.

The resolution was checked with this script, to verify the string store does not grow as data streams through:

from __future__ import unicode_literals
import spacy
import numpy.random

def generate_strings():
    while True:
        yield str(numpy.random.random())

def main():
    nlp = spacy.blank('xx')
    nlp.tokenizer.infix_finditer = None
    for i, doc in enumerate(nlp.pipe(generate_strings())):
        if not i % 1000:
            print(i, len(nlp.vocab.strings))

if __name__ == '__main__':
    main()

Bug fix (non-breaking change fixing an issue)

Checklist:

My change requires a change to spaCy's documentation.
I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.

…data-memory-growth

ashaffer · 2017-10-26T16:46:15Z

Hey, thanks for fixing this. Is this going to be backported into v1 though?

honnibal added 7 commits October 16, 2017 19:22

Allow weakrefs on Doc objects

59c2161

Create a rolling buffer for the StringStore in Language.pipe()

5c14f3f

Remove obsolete is_frozen functionality from StringStore

3e03705

Remove caching of Token in Doc, as caused cycle.

a002264

Clean up remnant of frozen in StringStore

66e2eb8

Bump rolling buffer size to 10k

2bc06e4

Fix equality check in test

4174477

honnibal changed the title ~~Fix streaming data memory growth (!!)~~ 💫 Fix streaming data memory growth (!!) Oct 16, 2017

ines added bug Bugs and behaviour differing from documentation 🌙 nightly Discussion and contributions related to nightly builds labels Oct 16, 2017

This was referenced Oct 16, 2017

OverflowError: value too large to convert to int32_t #1225

Closed

Streaming Data Memory Growth #285

Closed

honnibal added 2 commits October 16, 2017 20:49

Merge remote-tracking branch 'origin/develop' into feature/streaming-…

4018486

…data-memory-growth

Merge branch 'develop' into feature/streaming-data-memory-growth

19531ba

honnibal merged commit fc797a5 into develop Oct 16, 2017

ines deleted the feature/streaming-data-memory-growth branch October 25, 2017 14:35

rdmrcv mentioned this pull request Nov 8, 2017

KeyError during stream processing #1506

Closed

adrianog mentioned this pull request Jan 2, 2018

Memory leak when repeatedly calling doc.count() over a large corpus chartbeat-labs/textacy#150

Closed

ines mentioned this pull request Jan 15, 2018

💫 Continuous deployment of dev builds, wheels etc. #1843

Closed

dancsalo mentioned this pull request Mar 2, 2020

Streaming Data Memory Growth (reprise) #5083

Closed

svlandeg mentioned this pull request Jan 2, 2023

Delete unused freeze argument from StringStore #12039

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💫 Fix streaming data memory growth (!!) #1424

💫 Fix streaming data memory growth (!!) #1424

honnibal commented Oct 16, 2017 •

edited by ines

Loading

ashaffer commented Oct 26, 2017

💫 Fix streaming data memory growth (!!) #1424

💫 Fix streaming data memory growth (!!) #1424

Conversation

honnibal commented Oct 16, 2017 • edited by ines Loading

Checklist:

ashaffer commented Oct 26, 2017

honnibal commented Oct 16, 2017 •

edited by ines

Loading