Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError during stream processing #1506

Closed
honnibal opened this issue Nov 7, 2017 · 13 comments · Fixed by #1552
Closed

KeyError during stream processing #1506

honnibal opened this issue Nov 7, 2017 · 13 comments · Fixed by #1552
Labels
bug Bugs and behaviour differing from documentation

Comments

@honnibal
Copy link
Member

honnibal commented Nov 7, 2017

@ligser writes on Gitter:

Hello all! I have some troubles with spacy-nightly==2.0.0rc1 (a18 has same behavior) with en_core_web_lg model. When I run nlp.pipe with a generator of a texts I get the exception: KeyError: 4405041669077156115..
Exception raised after amount of texts (average 10000).
Stacktrace looks like that:
    nlp.pipe((c.content_text for c in texts), batch_size=24, n_threads=8)
  File "doc.pyx", line 375, in spacy.tokens.doc.Doc.text.__get__
  File "doc.pyx", line 232, in __iter__ 
  File "token.pyx", line 178, in spacy.tokens.token.Token.text_with_ws.__get__
  File "strings.pyx", line 116, in spacy.strings.StringStore.__getitem__
KeyError: 4405041669077156115
That looks like a bug with a StringStore cleanup or something related (maybe shared string-store that clean-up by one of threads?).
My code just get a texts from mysql split it to texts and ids and do: for id, doc in zip(ids_gen,nlp.pipe(docs_gen, ...)).

I think this is likely due to the solution added in spaCy 2 to address the streaming data memory growth.

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Nov 7, 2017
@rdmrcv
Copy link
Contributor

rdmrcv commented Nov 8, 2017

I made some experiments and reproduce similar error as test case. I'm very new in python, and code looks ugly and work slow, but it cause error that looks same (but have different stacktrace):

# coding: utf8
from __future__ import unicode_literals

import random
import string

from ...lang.en import English


def test_issue1506():
    nlp = English()

    def random_string_generator(string_length, limit):
        for _ in range(limit):
            yield ''.join(
                random.choice(string.digits + string.ascii_letters + '. ') for _ in range(string_length))

    for i, d in zip(
        (i for i in range(20007)),
        nlp.pipe(random_string_generator(600, 20007))
    ):
        str(d.text)

Info about spaCy

  • spaCy version: 2.0.2.dev0
  • Platform: Darwin-17.2.0-x86_64-i386-64bit
  • Python version: 3.6.2

Stacktrace:

spacy/language.py:554: in pipe
    for doc in docs:
spacy/language.py:534: in <genexpr>
    docs = (self.make_doc(text) for text in texts)
spacy/language.py:357: in make_doc
    return self.tokenizer(text)
tokenizer.pyx:106: in spacy.tokenizer.Tokenizer.__call__
    ???
tokenizer.pyx:156: in spacy.tokenizer.Tokenizer._tokenize
    ???
tokenizer.pyx:235: in spacy.tokenizer.Tokenizer._attach_tokens
    ???
doc.pyx:547: in spacy.tokens.doc.Doc.push_back
    ???
morphology.pyx:81: in spacy.morphology.Morphology.assign_untagged
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   KeyError: 10868232842057966403

strings.pyx:116: KeyError

@rdmrcv
Copy link
Contributor

rdmrcv commented Nov 8, 2017

Also when I revert from the language.py changes from PR #1424 — that test light green.

@andharris
Copy link

I'm having a similar issue when accessing token.lemma_ for some tokens.

@ines
Copy link
Member

ines commented Nov 9, 2017

@andharris Do you still have an example text by any chance?

@andharris
Copy link

@ines Here's a reproducible script:

import spacy
import thinc.extra.datasets


def main():
    nlp = spacy.blank('en')
    data, _ = thinc.extra.datasets.imdb()
    corpus = (i[0] for i in data)
    docs = nlp.pipe(corpus)
    lemmas = [[token.lemma_ for token in doc] for doc in docs]
    print("Parsed lemmas for {} docs in corpus".format(len(lemmas)))


if __name__ == '__main__':
    main()

Info:

  • spacy: 2.0.2
  • python: 3.6.2

Stacktrace:

Traceback (most recent call last):
  File "spacy_bug.py", line 15, in <module>
    main()
  File "spacy_bug.py", line 10, in main
    lemmas = [[token.lemma_ for token in doc] for doc in docs]
  File "spacy_bug.py", line 10, in <listcomp>
    lemmas = [[token.lemma_ for token in doc] for doc in docs]
  File ".../venv/lib/python3.6/site-packages/spacy/language.py", line 554, in pipe
    for doc in docs:
  File ".../venv/lib/python3.6/site-packages/spacy/language.py", line 534, in <genexpr>
    docs = (self.make_doc(text) for text in texts)
  File ".../venv/lib/python3.6/site-packages/spacy/language.py", line 357, in make_doc
    return self.tokenizer(text)
  File "tokenizer.pyx", line 106, in spacy.tokenizer.Tokenizer.__call__
  File "tokenizer.pyx", line 156, in spacy.tokenizer.Tokenizer._tokenize
  File "tokenizer.pyx", line 235, in spacy.tokenizer.Tokenizer._attach_tokens
  File "doc.pyx", line 547, in spacy.tokens.doc.Doc.push_back
  File "morphology.pyx", line 81, in spacy.morphology.Morphology.assign_untagged
  File "strings.pyx", line 116, in spacy.strings.StringStore.__getitem__
KeyError: 5846064049184721376

Interestingly if I replace docs = nlp.pipe(corpus) with docs = (nlp.tokenizer(doc) for doc in corpus) I no longer get the error. Not user why this works though and the other fails.

@rdmrcv
Copy link
Contributor

rdmrcv commented Nov 10, 2017

(nlp.tokenizer(doc) for doc in corpus) don't clean up StringStore — looks like that cause error.

@rdmrcv
Copy link
Contributor

rdmrcv commented Nov 10, 2017

Also, I try to work around that case. And looks like it working...

        original_strings_data = self.vocab.strings.to_bytes()
        nr_seen = 0
        for doc in docs:
            yield doc
            recent_refs.add(doc)
            if nr_seen < 10000:
                old_refs.add(doc)
                nr_seen += 1
            elif len(old_refs) == 0:
                # All the docs in the 'old' set have expired, so the only
                # difference between the backup strings and the current
                # string-store should be obsolete. We therefore swap out the
                # old strings data.
                old_refs, recent_refs = recent_refs, old_refs
                tmp = self.vocab.strings.to_bytes()
                self.vocab.strings.from_bytes(original_strings_data)
                original_strings_data = tmp
                nr_seen = 0

I try to not track string manually and just swype it by lowlevel data.

Maybe that:

for word in doc:
    recent_strings.add(word.text)

not track all strings? (Looks like it does not track lemmas at all)

Or maybe I did not see something wrong with my code, and it just not cleans up?

@honnibal
Copy link
Member Author

@ligser : You're exactly right. It's not adding the lemmas or other new strings --- just the word text.

Periodically we need to do:

current = original + recent

Currently we're getting recent by just tracking doc.text. It might be best to add something to the StringStore, but I'm worried that this adds more state that can be lost in serialisation, causing confuing results.

What if we had:

recent = current - previous
current, previous = (original + recent), current

This seems like it should work.

@rdmrcv
Copy link
Contributor

rdmrcv commented Nov 10, 2017

If I properly understand what you mean, that code does that:

        origin_strings = list(self.vocab.strings)
        previous_strings = list()
        nr_seen = 0
        for doc in docs:
            yield doc
            recent_refs.add(doc)
            if nr_seen < 10000:
                old_refs.add(doc)
                nr_seen += 1
            elif len(old_refs) == 0:
                # All the docs in the 'old' set have expired, so the only
                # difference between the backup strings and the current
                # string-store should be obsolete. We therefore swap out the
                # old strings data.
                old_refs, recent_refs = recent_refs, old_refs
                current_strings = list(self.vocab.strings)
                recent_strings = [item for item in current_strings if item not in previous_strings]
                self.vocab.strings._reset_and_load(recent_strings + origin_strings)
                previous_strings = current_strings
                nr_seen = 0

@rdmrcv
Copy link
Contributor

rdmrcv commented Nov 10, 2017

But that not work.

Because if I subtract previous from current — I lost strings that presented in both (created at previous and used at current too), but I shouldn't do that. I try to think little more about strings that can be wiped out, looks like I understand things too literally.

@rdmrcv
Copy link
Contributor

rdmrcv commented Nov 10, 2017

Looks like at that level pipe just cannot decide which strings are fresh and which obsolete.
In your solution — you try to track truly recent strings. The only problem there is not completed list of words, because of lemmas and other changes to StringStore. That solution can work if you know how to track all of the strings.
In my solution with tmp var just not caused any cleanup — I just try to work in Nth iteration with N-1th string store and it works because of luck.

@rdmrcv
Copy link
Contributor

rdmrcv commented Nov 10, 2017

I think there can be used another version of StringStore class (PipedStringStore for example) that holds two different stores — «old» and «new» and know about iterations.
When we start new iteration it swaps StringStore and clean-up new one. When we work with PipedStringStore — it tries to use «new» StringStore and if the key does not exist here — it fallback to «old» and copy value that exists in «old» and not in «new».
After «iteration» it discards «old» with values that not used in that «iteration».

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants