Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to add ability to clean up StringStore in pipe #1552

Merged
merged 6 commits into from
Nov 11, 2017

Conversation

rdmrcv
Copy link
Contributor

@rdmrcv rdmrcv commented Nov 11, 2017

Description

I try to fix #1506 by allowing StringStore to clean up old strings when using it to iterate through many documents in a pipe. I try to do it with little effort on memory or speed, but I might not see all effect of my changes.
Because that method is written as the feature only for pipe method of the Language class — the method marked as internal by the underscore.

Types of change

It's bug fix.

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@explosion-bot
Copy link
Collaborator

Hi @ligser, thanks for your pull request! 👍 It looks like you haven't filled in the spaCy Contributor Agreement (SCA) yet. The agrement ensures that we can use your contribution across the project. Once you've filled in the template, put it in the .github/contributors directory and add it to this pull request. If your pull request targets a branch that's not master, for example develop, make sure to submit the Contributor Agreement to the master branch. Thanks a lot!

@honnibal
Copy link
Member

This looks very promising. Thank you for taking the time to do this efficiently.

@honnibal
Copy link
Member

Very good, thanks!

@honnibal honnibal merged commit 86d3730 into explosion:master Nov 11, 2017
@ines ines added the bug Bugs and behaviour differing from documentation label Nov 12, 2017
@rdmrcv
Copy link
Contributor Author

rdmrcv commented Nov 14, 2017

@honnibal I realize that here I broke swapping of recent and old refs. I afraid that change can cause the error: if nr_seen raise upper than 10000 and old_refs is not empty at some iterations we can lose some of the refs that not recorded to old_refs. When the old_refs become empty we initiate clean-up and remove strings from N-1th iteration (because of hits — N iteration still here). If one of the docs that we lose from old_refs still not cleaned-up after N+1th iteration and we try to get string data from it — we should get the crash.
I try to create the test case for that — but cannot reproduce that behavior. Maybe I miss something?

@rdmrcv
Copy link
Contributor Author

rdmrcv commented Nov 14, 2017

I understand. My fix not cleans up strings — just keys. 😐

@rdmrcv
Copy link
Contributor Author

rdmrcv commented Nov 14, 2017

I made a new pull request to fix it. Sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KeyError during stream processing
4 participants