Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💫 Fix streaming data memory growth (!!) #1424

Merged
merged 9 commits into from
Oct 16, 2017

Conversation

honnibal
Copy link
Member

@honnibal honnibal commented Oct 16, 2017

This patch follows through on the changes to the StringStore introduced in v2, and fixes by far the longest open bug: #285 Streaming Data Memory Growth

The solution implemented allows a weakref to be created on Doc objects, so that we can figure out when it's safe to flush out old strings. A rolling buffer is created using two StringStore objects. One StringStore contains the set of original strings and strings from recent documents. The other string store is the one current on the Vocab object. The only difference between the two StringStore objects are in strings created by increasingly old documents. Once we figure out that all of those old documents have passed out of scope and been reference counted, we figure it's safe to flush the obsolete strings.

This logic is only applied in the Language.pipe() method, and requires no API changes. However, it does require breaking the former 2-cycle between the Doc and Token objects, that had been created by caching the Token instances within the Doc. I think breaking this cycle is a Good Thing, independent of the StringStore changes. We don't need to cache those objects --- almost nothing is done in Token.__init__, so there should be no problem with recreating the object. If we like, we could use a weakref on Token, but it's simpler to just not cache.

The resolution was checked with this script, to verify the string store does not grow as data streams through:

from __future__ import unicode_literals
import spacy
import numpy.random

def generate_strings():
    while True:
        yield str(numpy.random.random())

def main():
    nlp = spacy.blank('xx')
    nlp.tokenizer.infix_finditer = None
    for i, doc in enumerate(nlp.pipe(generate_strings())):
        if not i % 1000:
            print(i, len(nlp.vocab.strings))

if __name__ == '__main__':
    main()
  • Bug fix (non-breaking change fixing an issue)

Checklist:

  • My change requires a change to spaCy's documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@honnibal honnibal changed the title Fix streaming data memory growth (!!) 💫 Fix streaming data memory growth (!!) Oct 16, 2017
@ines ines added bug Bugs and behaviour differing from documentation 🌙 nightly Discussion and contributions related to nightly builds labels Oct 16, 2017
@honnibal honnibal merged commit fc797a5 into develop Oct 16, 2017
@ines ines deleted the feature/streaming-data-memory-growth branch October 25, 2017 14:35
@ashaffer
Copy link

Hey, thanks for fixing this. Is this going to be backported into v1 though?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation 🌙 nightly Discussion and contributions related to nightly builds
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants