KeyError during stream processing #1506

honnibal · 2017-11-07T23:37:40Z

@ligser writes on Gitter:

Hello all! I have some troubles with spacy-nightly==2.0.0rc1 (a18 has same behavior) with en_core_web_lg model. When I run nlp.pipe with a generator of a texts I get the exception: KeyError: 4405041669077156115..
Exception raised after amount of texts (average 10000).
Stacktrace looks like that:
    nlp.pipe((c.content_text for c in texts), batch_size=24, n_threads=8)
  File "doc.pyx", line 375, in spacy.tokens.doc.Doc.text.__get__
  File "doc.pyx", line 232, in __iter__ 
  File "token.pyx", line 178, in spacy.tokens.token.Token.text_with_ws.__get__
  File "strings.pyx", line 116, in spacy.strings.StringStore.__getitem__
KeyError: 4405041669077156115
That looks like a bug with a StringStore cleanup or something related (maybe shared string-store that clean-up by one of threads?).
My code just get a texts from mysql split it to texts and ids and do: for id, doc in zip(ids_gen,nlp.pipe(docs_gen, ...)).

I think this is likely due to the solution added in spaCy 2 to address the streaming data memory growth.

The text was updated successfully, but these errors were encountered:

rdmrcv · 2017-11-08T16:59:23Z

I made some experiments and reproduce similar error as test case. I'm very new in python, and code looks ugly and work slow, but it cause error that looks same (but have different stacktrace):

# coding: utf8
from __future__ import unicode_literals

import random
import string

from ...lang.en import English


def test_issue1506():
    nlp = English()

    def random_string_generator(string_length, limit):
        for _ in range(limit):
            yield ''.join(
                random.choice(string.digits + string.ascii_letters + '. ') for _ in range(string_length))

    for i, d in zip(
        (i for i in range(20007)),
        nlp.pipe(random_string_generator(600, 20007))
    ):
        str(d.text)

Info about spaCy

spaCy version: 2.0.2.dev0
Platform: Darwin-17.2.0-x86_64-i386-64bit
Python version: 3.6.2

Stacktrace:

spacy/language.py:554: in pipe
    for doc in docs:
spacy/language.py:534: in <genexpr>
    docs = (self.make_doc(text) for text in texts)
spacy/language.py:357: in make_doc
    return self.tokenizer(text)
tokenizer.pyx:106: in spacy.tokenizer.Tokenizer.__call__
    ???
tokenizer.pyx:156: in spacy.tokenizer.Tokenizer._tokenize
    ???
tokenizer.pyx:235: in spacy.tokenizer.Tokenizer._attach_tokens
    ???
doc.pyx:547: in spacy.tokens.doc.Doc.push_back
    ???
morphology.pyx:81: in spacy.morphology.Morphology.assign_untagged
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   KeyError: 10868232842057966403

strings.pyx:116: KeyError

rdmrcv · 2017-11-08T19:23:14Z

Also when I revert from the language.py changes from PR #1424 — that test light green.

andharris · 2017-11-09T20:51:41Z

I'm having a similar issue when accessing token.lemma_ for some tokens.

ines · 2017-11-09T22:14:50Z

@andharris Do you still have an example text by any chance?

andharris · 2017-11-10T12:37:32Z

@ines Here's a reproducible script:

import spacy
import thinc.extra.datasets


def main():
    nlp = spacy.blank('en')
    data, _ = thinc.extra.datasets.imdb()
    corpus = (i[0] for i in data)
    docs = nlp.pipe(corpus)
    lemmas = [[token.lemma_ for token in doc] for doc in docs]
    print("Parsed lemmas for {} docs in corpus".format(len(lemmas)))


if __name__ == '__main__':
    main()

Info:

spacy: 2.0.2
python: 3.6.2

Stacktrace:

Traceback (most recent call last):
  File "spacy_bug.py", line 15, in <module>
    main()
  File "spacy_bug.py", line 10, in main
    lemmas = [[token.lemma_ for token in doc] for doc in docs]
  File "spacy_bug.py", line 10, in <listcomp>
    lemmas = [[token.lemma_ for token in doc] for doc in docs]
  File ".../venv/lib/python3.6/site-packages/spacy/language.py", line 554, in pipe
    for doc in docs:
  File ".../venv/lib/python3.6/site-packages/spacy/language.py", line 534, in <genexpr>
    docs = (self.make_doc(text) for text in texts)
  File ".../venv/lib/python3.6/site-packages/spacy/language.py", line 357, in make_doc
    return self.tokenizer(text)
  File "tokenizer.pyx", line 106, in spacy.tokenizer.Tokenizer.__call__
  File "tokenizer.pyx", line 156, in spacy.tokenizer.Tokenizer._tokenize
  File "tokenizer.pyx", line 235, in spacy.tokenizer.Tokenizer._attach_tokens
  File "doc.pyx", line 547, in spacy.tokens.doc.Doc.push_back
  File "morphology.pyx", line 81, in spacy.morphology.Morphology.assign_untagged
  File "strings.pyx", line 116, in spacy.strings.StringStore.__getitem__
KeyError: 5846064049184721376

Interestingly if I replace docs = nlp.pipe(corpus) with docs = (nlp.tokenizer(doc) for doc in corpus) I no longer get the error. Not user why this works though and the other fails.

rdmrcv · 2017-11-10T12:55:29Z

(nlp.tokenizer(doc) for doc in corpus) don't clean up StringStore — looks like that cause error.

rdmrcv · 2017-11-10T13:24:26Z

Also, I try to work around that case. And looks like it working...

        original_strings_data = self.vocab.strings.to_bytes()
        nr_seen = 0
        for doc in docs:
            yield doc
            recent_refs.add(doc)
            if nr_seen < 10000:
                old_refs.add(doc)
                nr_seen += 1
            elif len(old_refs) == 0:
                # All the docs in the 'old' set have expired, so the only
                # difference between the backup strings and the current
                # string-store should be obsolete. We therefore swap out the
                # old strings data.
                old_refs, recent_refs = recent_refs, old_refs
                tmp = self.vocab.strings.to_bytes()
                self.vocab.strings.from_bytes(original_strings_data)
                original_strings_data = tmp
                nr_seen = 0

I try to not track string manually and just swype it by lowlevel data.

Maybe that:

for word in doc:
    recent_strings.add(word.text)

not track all strings? (Looks like it does not track lemmas at all)

Or maybe I did not see something wrong with my code, and it just not cleans up?

honnibal · 2017-11-10T13:48:00Z

@ligser : You're exactly right. It's not adding the lemmas or other new strings --- just the word text.

Periodically we need to do:

current = original + recent

Currently we're getting recent by just tracking doc.text. It might be best to add something to the StringStore, but I'm worried that this adds more state that can be lost in serialisation, causing confuing results.

What if we had:

recent = current - previous
current, previous = (original + recent), current

This seems like it should work.

rdmrcv · 2017-11-10T14:19:02Z

If I properly understand what you mean, that code does that:

        origin_strings = list(self.vocab.strings)
        previous_strings = list()
        nr_seen = 0
        for doc in docs:
            yield doc
            recent_refs.add(doc)
            if nr_seen < 10000:
                old_refs.add(doc)
                nr_seen += 1
            elif len(old_refs) == 0:
                # All the docs in the 'old' set have expired, so the only
                # difference between the backup strings and the current
                # string-store should be obsolete. We therefore swap out the
                # old strings data.
                old_refs, recent_refs = recent_refs, old_refs
                current_strings = list(self.vocab.strings)
                recent_strings = [item for item in current_strings if item not in previous_strings]
                self.vocab.strings._reset_and_load(recent_strings + origin_strings)
                previous_strings = current_strings
                nr_seen = 0

rdmrcv · 2017-11-10T14:47:24Z

But that not work.

Because if I subtract previous from current — I lost strings that presented in both (created at previous and used at current too), but I shouldn't do that. I try to think little more about strings that can be wiped out, looks like I understand things too literally.

rdmrcv · 2017-11-10T15:17:25Z

Looks like at that level pipe just cannot decide which strings are fresh and which obsolete.
In your solution — you try to track truly recent strings. The only problem there is not completed list of words, because of lemmas and other changes to StringStore. That solution can work if you know how to track all of the strings.
In my solution with tmp var just not caused any cleanup — I just try to work in Nth iteration with N-1th string store and it works because of luck.

rdmrcv · 2017-11-10T15:52:33Z

I think there can be used another version of StringStore class (PipedStringStore for example) that holds two different stores — «old» and «new» and know about iterations.
When we start new iteration it swaps StringStore and clean-up new one. When we work with PipedStringStore — it tries to use «new» StringStore and if the key does not exist here — it fallback to «old» and copy value that exists in «old» and not in «new».
After «iteration» it discards «old» with values that not used in that «iteration».

lock · 2018-05-08T08:27:55Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the bug Bugs and behaviour differing from documentation label Nov 7, 2017

rdmrcv added a commit to rdmrcv/spaCy that referenced this issue Nov 11, 2017

Try to fix StringStore clean up (see explosion#1506)

3c600ad

rdmrcv mentioned this issue Nov 11, 2017

Try to add ability to clean up StringStore in pipe #1552

Merged

3 tasks

honnibal closed this as completed in #1552 Nov 11, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError during stream processing #1506

KeyError during stream processing #1506

honnibal commented Nov 7, 2017

rdmrcv commented Nov 8, 2017 •

edited

Loading

rdmrcv commented Nov 8, 2017

andharris commented Nov 9, 2017

ines commented Nov 9, 2017

andharris commented Nov 10, 2017

rdmrcv commented Nov 10, 2017

rdmrcv commented Nov 10, 2017 •

edited

Loading

honnibal commented Nov 10, 2017

rdmrcv commented Nov 10, 2017

rdmrcv commented Nov 10, 2017

rdmrcv commented Nov 10, 2017

rdmrcv commented Nov 10, 2017 •

edited

Loading

lock bot commented May 8, 2018

KeyError during stream processing #1506

KeyError during stream processing #1506

Comments

honnibal commented Nov 7, 2017

rdmrcv commented Nov 8, 2017 • edited Loading

Info about spaCy

rdmrcv commented Nov 8, 2017

andharris commented Nov 9, 2017

ines commented Nov 9, 2017

andharris commented Nov 10, 2017

rdmrcv commented Nov 10, 2017

rdmrcv commented Nov 10, 2017 • edited Loading

honnibal commented Nov 10, 2017

rdmrcv commented Nov 10, 2017

rdmrcv commented Nov 10, 2017

rdmrcv commented Nov 10, 2017

rdmrcv commented Nov 10, 2017 • edited Loading

lock bot commented May 8, 2018

rdmrcv commented Nov 8, 2017 •

edited

Loading

rdmrcv commented Nov 10, 2017 •

edited

Loading

rdmrcv commented Nov 10, 2017 •

edited

Loading