-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyError during stream processing #1506
Comments
I made some experiments and reproduce similar error as test case. I'm very new in python, and code looks ugly and work slow, but it cause error that looks same (but have different stacktrace):
Info about spaCy
Stacktrace:
|
Also when I revert from the |
I'm having a similar issue when accessing |
@andharris Do you still have an example text by any chance? |
@ines Here's a reproducible script: import spacy
import thinc.extra.datasets
def main():
nlp = spacy.blank('en')
data, _ = thinc.extra.datasets.imdb()
corpus = (i[0] for i in data)
docs = nlp.pipe(corpus)
lemmas = [[token.lemma_ for token in doc] for doc in docs]
print("Parsed lemmas for {} docs in corpus".format(len(lemmas)))
if __name__ == '__main__':
main() Info:
Stacktrace:
Interestingly if I replace |
|
Also, I try to work around that case. And looks like it working...
I try to not track string manually and just swype it by lowlevel data. Maybe that:
not track all strings? (Looks like it does not track lemmas at all) Or maybe I did not see something wrong with my code, and it just not cleans up? |
@ligser : You're exactly right. It's not adding the lemmas or other new strings --- just the word text. Periodically we need to do:
Currently we're getting What if we had:
This seems like it should work. |
If I properly understand what you mean, that code does that:
|
But that not work. Because if I subtract previous from current — I lost strings that presented in both (created at previous and used at current too), but I shouldn't do that. I try to think little more about strings that can be wiped out, looks like I understand things too literally. |
Looks like at that level pipe just cannot decide which strings are fresh and which obsolete. |
I think there can be used another version of StringStore class (PipedStringStore for example) that holds two different stores — «old» and «new» and know about iterations. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
@ligser writes on Gitter:
I think this is likely due to the solution added in spaCy 2 to address the streaming data memory growth.
The text was updated successfully, but these errors were encountered: